This is a project where my group tried to model Airbnb prices in Mexico city using multivariate regressions.

listings <- vroom("http://data.insideairbnb.com/mexico/df/mexico-city/2020-06-20/data/listings.csv.gz")

Data Wrangling & Exploratory Data Analysis

Looking at raw values

glimpse(listings)
## Rows: 21,824
## Columns: 106
## $ id                                           <dbl> 35797, 56074, 61792, 7...
## $ listing_url                                  <chr> "https://www.airbnb.co...
## $ scrape_id                                    <dbl> 2.02e+13, 2.02e+13, 2....
## $ last_scraped                                 <date> 2020-06-23, 2020-06-2...
## $ name                                         <chr> "Villa Dante", "Great ...
## $ summary                                      <chr> "Dentro de Villa un es...
## $ space                                        <chr> "please go to (URL HID...
## $ description                                  <chr> "Dentro de Villa un es...
## $ experiences_offered                          <chr> "none", "none", "none"...
## $ neighborhood_overview                        <chr> "Centro comercial Sant...
## $ notes                                        <chr> "Si te gustan la tipo ...
## $ transit                                      <chr> "Uber es buena opción ...
## $ access                                       <chr> "Jardin muy Amplio.", ...
## $ interaction                                  <chr> "Cualquier duda contác...
## $ house_rules                                  <chr> "Se renta un  estudio ...
## $ thumbnail_url                                <lgl> NA, NA, NA, NA, NA, NA...
## $ medium_url                                   <lgl> NA, NA, NA, NA, NA, NA...
## $ picture_url                                  <chr> "https://a0.muscache.c...
## $ xl_picture_url                               <lgl> NA, NA, NA, NA, NA, NA...
## $ host_id                                      <dbl> 153786, 265650, 299558...
## $ host_url                                     <chr> "https://www.airbnb.co...
## $ host_name                                    <chr> "Dici", "Maris", "Robe...
## $ host_since                                   <date> 2010-06-28, 2010-10-1...
## $ host_location                                <chr> "Mexico City, Mexico C...
## $ host_about                                   <chr> "Master in visual arts...
## $ host_response_time                           <chr> "N/A", "within an hour...
## $ host_response_rate                           <chr> "N/A", "100%", "100%",...
## $ host_acceptance_rate                         <chr> "N/A", "91%", "67%", "...
## $ host_is_superhost                            <lgl> FALSE, TRUE, FALSE, TR...
## $ host_thumbnail_url                           <chr> "https://a0.muscache.c...
## $ host_picture_url                             <chr> "https://a0.muscache.c...
## $ host_neighbourhood                           <chr> NA, "San Rafael", "Con...
## $ host_listings_count                          <dbl> 2, 2, 1, 4, 3, 3, 4, 2...
## $ host_total_listings_count                    <dbl> 2, 2, 1, 4, 3, 3, 4, 2...
## $ host_verifications                           <chr> "['email', 'phone', 'r...
## $ host_has_profile_pic                         <lgl> TRUE, TRUE, TRUE, TRUE...
## $ host_identity_verified                       <lgl> FALSE, FALSE, FALSE, T...
## $ street                                       <chr> "Mexico City, D.f., Me...
## $ neighbourhood                                <chr> NA, "San Rafael", "Con...
## $ neighbourhood_cleansed                       <chr> "Cuajimalpa de Morelos...
## $ neighbourhood_group_cleansed                 <lgl> NA, NA, NA, NA, NA, NA...
## $ city                                         <chr> "Mexico City", "Mexico...
## $ state                                        <chr> "D.f.", "DF", "Ciudad ...
## $ zipcode                                      <chr> NA, NA, "06140", "0410...
## $ market                                       <chr> "Mexico City", "Mexico...
## $ smart_location                               <chr> "Mexico City, Mexico",...
## $ country_code                                 <chr> "MX", "MX", "MX", "MX"...
## $ country                                      <chr> "Mexico", "Mexico", "M...
## $ latitude                                     <dbl> 19.4, 19.4, 19.4, 19.4...
## $ longitude                                    <dbl> -99.3, -99.2, -99.2, -...
## $ is_location_exact                            <lgl> FALSE, TRUE, TRUE, TRU...
## $ property_type                                <chr> "Villa", "Condominium"...
## $ room_type                                    <chr> "Entire home/apt", "En...
## $ accommodates                                 <dbl> 2, 3, 2, 2, 2, 2, 14, ...
## $ bathrooms                                    <dbl> 1.0, 1.0, 1.0, 1.0, 1....
## $ bedrooms                                     <dbl> 1, 1, 1, 1, 1, 1, 4, 1...
## $ beds                                         <dbl> 1, 2, 1, 1, 1, 1, 10, ...
## $ bed_type                                     <chr> "Futon", "Real Bed", "...
## $ amenities                                    <chr> "{Wifi,Kitchen,\"Free ...
## $ square_feet                                  <dbl> 32292, 646, 161, NA, 1...
## $ price                                        <chr> "$4,500.00", "$843.00"...
## $ weekly_price                                 <chr> NA, "$4,740.00", NA, "...
## $ monthly_price                                <chr> "$124,995.00", "$15,72...
## $ security_deposit                             <chr> NA, "$2,279.00", "$11,...
## $ cleaning_fee                                 <chr> NA, "$684.00", "$340.0...
## $ guests_included                              <dbl> 1, 2, 2, 2, 1, 1, 6, 1...
## $ extra_people                                 <chr> "$0.00", "$342.00", "$...
## $ minimum_nights                               <dbl> 1, 4, 2, 6, 4, 1, 2, 4...
## $ maximum_nights                               <dbl> 7, 150, 21, 180, 365, ...
## $ minimum_minimum_nights                       <dbl> 1, 4, 2, 6, 4, 1, 2, 4...
## $ maximum_minimum_nights                       <dbl> 1, 4, 2, 6, 4, 1, 2, 4...
## $ minimum_maximum_nights                       <dbl> 7, 1125, 21, 180, 365,...
## $ maximum_maximum_nights                       <dbl> 7, 1125, 21, 180, 365,...
## $ minimum_nights_avg_ntm                       <dbl> 1.0, 4.0, 2.0, 6.0, 4....
## $ maximum_nights_avg_ntm                       <dbl> 7, 1125, 21, 180, 365,...
## $ calendar_updated                             <chr> "35 months ago", "4 we...
## $ has_availability                             <lgl> TRUE, TRUE, TRUE, TRUE...
## $ availability_30                              <dbl> 23, 0, 30, 28, 0, 0, 2...
## $ availability_60                              <dbl> 53, 0, 60, 58, 19, 0, ...
## $ availability_90                              <dbl> 83, 0, 90, 88, 49, 18,...
## $ availability_365                             <dbl> 358, 0, 180, 363, 319,...
## $ calendar_last_scraped                        <date> 2020-06-23, 2020-06-2...
## $ number_of_reviews                            <dbl> 0, 60, 52, 102, 10, 0,...
## $ number_of_reviews_ltm                        <dbl> 0, 2, 1, 10, 2, 0, 11,...
## $ first_review                                 <date> NA, 2017-11-18, 2017-...
## $ last_review                                  <date> NA, 2019-07-24, 2019-...
## $ review_scores_rating                         <dbl> NA, 97, 98, 98, 100, N...
## $ review_scores_accuracy                       <dbl> NA, 10, 10, 10, 10, NA...
## $ review_scores_cleanliness                    <dbl> NA, 10, 10, 10, 10, NA...
## $ review_scores_checkin                        <dbl> NA, 10, 10, 10, 10, NA...
## $ review_scores_communication                  <dbl> NA, 10, 10, 10, 10, NA...
## $ review_scores_location                       <dbl> NA, 10, 10, 10, 10, NA...
## $ review_scores_value                          <dbl> NA, 10, 10, 10, 10, NA...
## $ requires_license                             <lgl> FALSE, FALSE, FALSE, F...
## $ license                                      <lgl> NA, NA, NA, NA, NA, NA...
## $ jurisdiction_names                           <chr> "{\"Mexico City\",\" M...
## $ instant_bookable                             <lgl> FALSE, TRUE, FALSE, FA...
## $ is_business_travel_ready                     <lgl> FALSE, FALSE, FALSE, F...
## $ cancellation_policy                          <chr> "flexible", "moderate"...
## $ require_guest_profile_picture                <lgl> FALSE, FALSE, FALSE, F...
## $ require_guest_phone_verification             <lgl> FALSE, FALSE, FALSE, F...
## $ calculated_host_listings_count               <dbl> 1, 2, 2, 3, 2, 3, 4, 2...
## $ calculated_host_listings_count_entire_homes  <dbl> 1, 2, 0, 2, 2, 1, 2, 0...
## $ calculated_host_listings_count_private_rooms <dbl> 0, 0, 2, 1, 0, 2, 2, 2...
## $ calculated_host_listings_count_shared_rooms  <dbl> 0, 0, 0, 0, 0, 0, 0, 0...
## $ reviews_per_month                            <dbl> NA, 1.89, 1.62, 1.00, ...

We have 21824 rows of data and 106 columns, i.e. variables. We can see the following data types: - Strings () - Numbers () - Dates () - Logicals () - List of Strings

Summary Statistics and Variables of interest

skim(listings)
Data summary
Name listings
Number of rows 21824
Number of columns 106
_______________________
Column type frequency:
character 47
Date 5
logical 15
numeric 39
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
listing_url 0 1.00 34 37 0 21824 0
name 8 1.00 1 255 0 21094 0
summary 1443 0.93 1 1000 0 18458 0
space 6009 0.72 1 1000 0 14005 0
description 1141 0.95 1 1000 0 19519 0
experiences_offered 0 1.00 4 4 0 1 0
neighborhood_overview 6312 0.71 1 1000 0 12861 0
notes 13372 0.39 1 1000 0 7097 0
transit 7254 0.67 1 1000 0 12094 0
access 10332 0.53 1 1000 0 9843 0
interaction 7743 0.65 1 1000 0 11412 0
house_rules 9438 0.57 1 1000 0 10357 0
picture_url 0 1.00 35 146 0 21216 0
host_url 0 1.00 38 43 0 13139 0
host_name 0 1.00 1 35 0 4149 0
host_location 75 1.00 1 104 0 625 0
host_about 8718 0.60 1 5443 0 7251 8
host_response_time 0 1.00 3 18 0 5 0
host_response_rate 0 1.00 2 4 0 52 0
host_acceptance_rate 0 1.00 2 4 0 81 0
host_thumbnail_url 0 1.00 55 106 0 13103 0
host_picture_url 0 1.00 57 109 0 13103 0
host_neighbourhood 9299 0.57 3 39 0 171 0
host_verifications 0 1.00 2 161 0 304 0
street 0 1.00 10 166 0 528 0
neighbourhood 4894 0.78 4 33 0 54 0
neighbourhood_cleansed 0 1.00 7 22 0 16 0
city 30 1.00 2 146 0 270 0
state 150 0.99 2 38 0 120 0
zipcode 1172 0.95 4 31 0 865 0
market 25 1.00 8 21 0 5 0
smart_location 0 1.00 6 154 0 291 0
country_code 0 1.00 2 2 0 1 0
country 0 1.00 6 6 0 1 0
property_type 0 1.00 3 23 0 37 0
room_type 0 1.00 10 15 0 4 0
bed_type 4 1.00 5 13 0 5 0
amenities 0 1.00 2 1714 0 20546 0
price 0 1.00 5 11 0 1042 0
weekly_price 20852 0.04 6 11 0 573 0
monthly_price 20873 0.04 7 13 0 603 0
security_deposit 9696 0.56 5 11 0 671 0
cleaning_fee 7439 0.66 5 10 0 524 0
extra_people 0 1.00 5 9 0 392 0
calendar_updated 0 1.00 5 14 0 88 0
jurisdiction_names 331 0.98 18 110 0 17 0
cancellation_policy 1 1.00 8 27 0 5 0

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
last_scraped 0 1.00 2020-06-20 2020-06-26 2020-06-21 7
host_since 0 1.00 2009-02-03 2020-06-16 2016-11-01 2947
calendar_last_scraped 0 1.00 2020-06-20 2020-06-26 2020-06-21 7
first_review 5371 0.75 2011-07-28 2020-06-22 2018-12-04 2113
last_review 5371 0.75 2013-12-21 2020-06-25 2020-03-03 1287

Variable type: logical

skim_variable n_missing complete_rate mean count
thumbnail_url 21824 0 NaN :
medium_url 21824 0 NaN :
xl_picture_url 21824 0 NaN :
host_is_superhost 0 1 0.34 FAL: 14359, TRU: 7465
host_has_profile_pic 0 1 1.00 TRU: 21784, FAL: 40
host_identity_verified 0 1 0.30 FAL: 15266, TRU: 6558
neighbourhood_group_cleansed 21824 0 NaN :
is_location_exact 0 1 0.83 TRU: 18037, FAL: 3787
has_availability 0 1 1.00 TRU: 21824
requires_license 0 1 0.00 FAL: 21824
license 21824 0 NaN :
instant_bookable 0 1 0.57 TRU: 12471, FAL: 9353
is_business_travel_ready 0 1 0.00 FAL: 21824
require_guest_profile_picture 0 1 0.01 FAL: 21665, TRU: 159
require_guest_phone_verification 0 1 0.01 FAL: 21666, TRU: 158

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
id 0 1.00 2.84e+07 1.13e+07 3.58e+04 2.00e+07 3.02e+07 3.86e+07 4.39e+07 ▂▃▅▅▇
scrape_id 0 1.00 2.02e+13 0.00e+00 2.02e+13 2.02e+13 2.02e+13 2.02e+13 2.02e+13 ▁▁▇▁▁
host_id 0 1.00 1.22e+08 9.68e+07 7.36e+03 3.75e+07 1.02e+08 1.94e+08 3.50e+08 ▇▅▃▃▂
host_listings_count 0 1.00 3.04e+01 2.82e+02 0.00e+00 1.00e+00 2.00e+00 5.00e+00 3.33e+03 ▇▁▁▁▁
host_total_listings_count 0 1.00 3.04e+01 2.82e+02 0.00e+00 1.00e+00 2.00e+00 5.00e+00 3.33e+03 ▇▁▁▁▁
latitude 0 1.00 1.94e+01 5.00e-02 1.92e+01 1.94e+01 1.94e+01 1.94e+01 1.96e+01 ▁▁▅▇▁
longitude 0 1.00 -9.92e+01 4.00e-02 -9.93e+01 -9.92e+01 -9.92e+01 -9.92e+01 -9.90e+01 ▁▁▇▁▁
accommodates 0 1.00 3.05e+00 2.21e+00 1.00e+00 2.00e+00 2.00e+00 4.00e+00 5.00e+01 ▇▁▁▁▁
bathrooms 26 1.00 1.40e+00 1.00e+00 0.00e+00 1.00e+00 1.00e+00 1.50e+00 5.00e+01 ▇▁▁▁▁
bedrooms 48 1.00 1.42e+00 1.14e+00 0.00e+00 1.00e+00 1.00e+00 2.00e+00 5.00e+01 ▇▁▁▁▁
beds 274 0.99 1.83e+00 1.75e+00 0.00e+00 1.00e+00 1.00e+00 2.00e+00 5.00e+01 ▇▁▁▁▁
square_feet 21758 0.00 1.07e+03 3.99e+03 0.00e+00 0.00e+00 2.37e+02 8.40e+02 3.23e+04 ▇▁▁▁▁
guests_included 0 1.00 1.62e+00 1.34e+00 1.00e+00 1.00e+00 1.00e+00 2.00e+00 2.60e+01 ▇▁▁▁▁
minimum_nights 0 1.00 4.09e+00 2.24e+01 1.00e+00 1.00e+00 2.00e+00 2.00e+00 1.12e+03 ▇▁▁▁▁
maximum_nights 0 1.00 6.86e+02 7.08e+02 1.00e+00 4.50e+01 1.12e+03 1.12e+03 5.00e+04 ▇▁▁▁▁
minimum_minimum_nights 0 1.00 3.89e+00 1.96e+01 1.00e+00 1.00e+00 2.00e+00 2.00e+00 1.12e+03 ▇▁▁▁▁
maximum_minimum_nights 0 1.00 4.20e+00 2.26e+01 1.00e+00 1.00e+00 2.00e+00 3.00e+00 1.12e+03 ▇▁▁▁▁
minimum_maximum_nights 0 1.00 8.47e+02 6.70e+02 1.00e+00 3.65e+02 1.12e+03 1.12e+03 5.00e+04 ▇▁▁▁▁
maximum_maximum_nights 0 1.00 8.49e+02 6.69e+02 1.00e+00 3.65e+02 1.12e+03 1.12e+03 5.00e+04 ▇▁▁▁▁
minimum_nights_avg_ntm 0 1.00 4.04e+00 2.06e+01 1.00e+00 1.00e+00 2.00e+00 2.00e+00 1.12e+03 ▇▁▁▁▁
maximum_nights_avg_ntm 0 1.00 8.48e+02 6.69e+02 1.00e+00 3.65e+02 1.12e+03 1.12e+03 5.00e+04 ▇▁▁▁▁
availability_30 0 1.00 1.95e+01 1.24e+01 0.00e+00 1.00e+00 2.60e+01 3.00e+01 3.00e+01 ▃▁▁▂▇
availability_60 0 1.00 4.21e+01 2.38e+01 0.00e+00 2.20e+01 5.60e+01 6.00e+01 6.00e+01 ▂▁▁▁▇
availability_90 0 1.00 6.54e+01 3.47e+01 0.00e+00 4.90e+01 8.50e+01 9.00e+01 9.00e+01 ▂▁▁▁▇
availability_365 0 1.00 2.20e+02 1.39e+02 0.00e+00 8.90e+01 2.11e+02 3.61e+02 3.65e+02 ▃▂▃▁▇
number_of_reviews 0 1.00 2.36e+01 4.18e+01 0.00e+00 1.00e+00 6.00e+00 2.80e+01 5.55e+02 ▇▁▁▁▁
number_of_reviews_ltm 0 1.00 9.18e+00 1.50e+01 0.00e+00 0.00e+00 2.00e+00 1.20e+01 1.77e+02 ▇▁▁▁▁
review_scores_rating 5616 0.74 9.51e+01 8.64e+00 2.00e+01 9.40e+01 9.70e+01 1.00e+02 1.00e+02 ▁▁▁▁▇
review_scores_accuracy 5632 0.74 9.72e+00 8.50e-01 2.00e+00 1.00e+01 1.00e+01 1.00e+01 1.00e+01 ▁▁▁▁▇
review_scores_cleanliness 5632 0.74 9.59e+00 9.20e-01 2.00e+00 9.00e+00 1.00e+01 1.00e+01 1.00e+01 ▁▁▁▁▇
review_scores_checkin 5635 0.74 9.82e+00 7.20e-01 2.00e+00 1.00e+01 1.00e+01 1.00e+01 1.00e+01 ▁▁▁▁▇
review_scores_communication 5631 0.74 9.78e+00 7.90e-01 2.00e+00 1.00e+01 1.00e+01 1.00e+01 1.00e+01 ▁▁▁▁▇
review_scores_location 5636 0.74 9.81e+00 6.80e-01 2.00e+00 1.00e+01 1.00e+01 1.00e+01 1.00e+01 ▁▁▁▁▇
review_scores_value 5637 0.74 9.58e+00 9.00e-01 2.00e+00 9.00e+00 1.00e+01 1.00e+01 1.00e+01 ▁▁▁▁▇
calculated_host_listings_count 0 1.00 6.09e+00 1.60e+01 1.00e+00 1.00e+00 2.00e+00 4.00e+00 1.57e+02 ▇▁▁▁▁
calculated_host_listings_count_entire_homes 0 1.00 4.09e+00 1.55e+01 0.00e+00 0.00e+00 1.00e+00 2.00e+00 1.57e+02 ▇▁▁▁▁
calculated_host_listings_count_private_rooms 0 1.00 1.77e+00 4.20e+00 0.00e+00 0.00e+00 1.00e+00 2.00e+00 4.80e+01 ▇▁▁▁▁
calculated_host_listings_count_shared_rooms 0 1.00 9.00e-02 8.30e-01 0.00e+00 0.00e+00 0.00e+00 0.00e+00 2.00e+01 ▇▁▁▁▁
reviews_per_month 5371 0.75 1.36e+00 1.50e+00 1.00e-02 3.00e-01 8.20e-01 1.92e+00 1.47e+01 ▇▁▁▁▁

The table above provides summary statistics for variables in the listings data frame and helps to improve our understanding of it.

Numeric Variables

The table above has already grouped variables by data type. At first there are 39 numeric variables. However, we also noticed that some variables, such as price, are currently in character form, albeit being potentially more useful in numeric form. Let’s change them to number variables.

listings2 <- listings %>% 
  
  # Convert prices from characters to numeric format ( remove '$' and ',' then convert to number)
  mutate(price = as.numeric(gsub('[$,]', '', price)),
        weekly_price = as.numeric(gsub('[$,]', '', weekly_price)),
        monthly_price = as.numeric(gsub('[$,]', '', monthly_price)), 
        security_deposit = as.numeric(gsub('[$,]', '', security_deposit)),
        cleaning_fee = as.numeric(gsub('[$,]', '', cleaning_fee)),
        extra_people = as.numeric(gsub('[$,]', '', extra_people))) %>%
  
  # Convert percentages from characters to numeric format ( remove '%' then convert to number)
  mutate(host_response_rate = as.numeric(gsub('[%]', '', host_response_rate)) / 100,
         host_acceptance_rate = as.numeric(gsub('[%]', '', host_acceptance_rate)) / 100)

Now we check to ensure that they are numeric variables, and then display the new list of numeric variables.

# Use typeof to check if previously reformatted variables are now numeric variables
typeof(listings2$price)
## [1] "double"
typeof(listings2$weekly_price)
## [1] "double"
typeof(listings2$security_deposit)
## [1] "double"
typeof(listings2$cleaning_fee)
## [1] "double"
typeof(listings2$extra_people)
## [1] "double"
typeof(listings2$host_response_rate)
## [1] "double"
typeof(listings2$host_acceptance_rate)
## [1] "double"
listings2 %>%
  
  # Show only columns with numerical variables
  select_if(is.numeric) %>%
  
  # Use pivot.longer() to list numerical variables in a vertical list, values_to is just arbitrary as this value column will be deselected in the next step
  pivot_longer(c(1:47), names_to = "Numerical_variables", values_to = "value") %>%
  
  # Show column with variables name only
  select(-value) %>%
  
  # Take the top 47 rows only, as the rest are just repeated names
  head(47)

Numerical_variables
id
scrape_id
host_id
host_response_rate
host_acceptance_rate
host_listings_count
host_total_listings_count
latitude
longitude
accommodates
bathrooms
bedrooms
beds
square_feet
price
weekly_price
monthly_price
security_deposit
cleaning_fee
guests_included
extra_people
minimum_nights
maximum_nights
minimum_minimum_nights
maximum_minimum_nights
minimum_maximum_nights
maximum_maximum_nights
minimum_nights_avg_ntm
maximum_nights_avg_ntm
availability_30
availability_60
availability_90
availability_365
number_of_reviews
number_of_reviews_ltm
review_scores_rating
review_scores_accuracy
review_scores_cleanliness
review_scores_checkin
review_scores_communication
review_scores_location
review_scores_value
calculated_host_listings_count
calculated_host_listings_count_entire_homes
calculated_host_listings_count_private_rooms
calculated_host_listings_count_shared_rooms
reviews_per_month
The above list shows all numerical variables, in which price, weekly_price, monthly_price, security_deposit, cleaning_fee, extra_people, host_response_rate, and host_acceptance_rate are originally character variables, but have been converted to numerical format to suit the nature of their data. This confirms that we successfully reformatted these variables.

Factor Variables

We use the skim function again to see the number of unique variables per column. This helps us identify which category might contain factor variables. We can then use the distinct function to display the distinct components of these categories.

skim(listings2)
Data summary
Name listings2
Number of rows 21824
Number of columns 106
_______________________
Column type frequency:
character 39
Date 5
logical 15
numeric 47
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
listing_url 0 1.00 34 37 0 21824 0
name 8 1.00 1 255 0 21094 0
summary 1443 0.93 1 1000 0 18458 0
space 6009 0.72 1 1000 0 14005 0
description 1141 0.95 1 1000 0 19519 0
experiences_offered 0 1.00 4 4 0 1 0
neighborhood_overview 6312 0.71 1 1000 0 12861 0
notes 13372 0.39 1 1000 0 7097 0
transit 7254 0.67 1 1000 0 12094 0
access 10332 0.53 1 1000 0 9843 0
interaction 7743 0.65 1 1000 0 11412 0
house_rules 9438 0.57 1 1000 0 10357 0
picture_url 0 1.00 35 146 0 21216 0
host_url 0 1.00 38 43 0 13139 0
host_name 0 1.00 1 35 0 4149 0
host_location 75 1.00 1 104 0 625 0
host_about 8718 0.60 1 5443 0 7251 8
host_response_time 0 1.00 3 18 0 5 0
host_thumbnail_url 0 1.00 55 106 0 13103 0
host_picture_url 0 1.00 57 109 0 13103 0
host_neighbourhood 9299 0.57 3 39 0 171 0
host_verifications 0 1.00 2 161 0 304 0
street 0 1.00 10 166 0 528 0
neighbourhood 4894 0.78 4 33 0 54 0
neighbourhood_cleansed 0 1.00 7 22 0 16 0
city 30 1.00 2 146 0 270 0
state 150 0.99 2 38 0 120 0
zipcode 1172 0.95 4 31 0 865 0
market 25 1.00 8 21 0 5 0
smart_location 0 1.00 6 154 0 291 0
country_code 0 1.00 2 2 0 1 0
country 0 1.00 6 6 0 1 0
property_type 0 1.00 3 23 0 37 0
room_type 0 1.00 10 15 0 4 0
bed_type 4 1.00 5 13 0 5 0
amenities 0 1.00 2 1714 0 20546 0
calendar_updated 0 1.00 5 14 0 88 0
jurisdiction_names 331 0.98 18 110 0 17 0
cancellation_policy 1 1.00 8 27 0 5 0

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
last_scraped 0 1.00 2020-06-20 2020-06-26 2020-06-21 7
host_since 0 1.00 2009-02-03 2020-06-16 2016-11-01 2947
calendar_last_scraped 0 1.00 2020-06-20 2020-06-26 2020-06-21 7
first_review 5371 0.75 2011-07-28 2020-06-22 2018-12-04 2113
last_review 5371 0.75 2013-12-21 2020-06-25 2020-03-03 1287

Variable type: logical

skim_variable n_missing complete_rate mean count
thumbnail_url 21824 0 NaN :
medium_url 21824 0 NaN :
xl_picture_url 21824 0 NaN :
host_is_superhost 0 1 0.34 FAL: 14359, TRU: 7465
host_has_profile_pic 0 1 1.00 TRU: 21784, FAL: 40
host_identity_verified 0 1 0.30 FAL: 15266, TRU: 6558
neighbourhood_group_cleansed 21824 0 NaN :
is_location_exact 0 1 0.83 TRU: 18037, FAL: 3787
has_availability 0 1 1.00 TRU: 21824
requires_license 0 1 0.00 FAL: 21824
license 21824 0 NaN :
instant_bookable 0 1 0.57 TRU: 12471, FAL: 9353
is_business_travel_ready 0 1 0.00 FAL: 21824
require_guest_profile_picture 0 1 0.01 FAL: 21665, TRU: 159
require_guest_phone_verification 0 1 0.01 FAL: 21666, TRU: 158

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
id 0 1.00 2.84e+07 1.13e+07 3.58e+04 2.00e+07 3.02e+07 3.86e+07 4.39e+07 ▂▃▅▅▇
scrape_id 0 1.00 2.02e+13 0.00e+00 2.02e+13 2.02e+13 2.02e+13 2.02e+13 2.02e+13 ▁▁▇▁▁
host_id 0 1.00 1.22e+08 9.68e+07 7.36e+03 3.75e+07 1.02e+08 1.94e+08 3.50e+08 ▇▅▃▃▂
host_response_rate 8450 0.61 9.10e-01 2.30e-01 0.00e+00 1.00e+00 1.00e+00 1.00e+00 1.00e+00 ▁▁▁▁▇
host_acceptance_rate 3868 0.82 8.90e-01 2.30e-01 0.00e+00 9.00e-01 9.90e-01 1.00e+00 1.00e+00 ▁▁▁▁▇
host_listings_count 0 1.00 3.04e+01 2.82e+02 0.00e+00 1.00e+00 2.00e+00 5.00e+00 3.33e+03 ▇▁▁▁▁
host_total_listings_count 0 1.00 3.04e+01 2.82e+02 0.00e+00 1.00e+00 2.00e+00 5.00e+00 3.33e+03 ▇▁▁▁▁
latitude 0 1.00 1.94e+01 5.00e-02 1.92e+01 1.94e+01 1.94e+01 1.94e+01 1.96e+01 ▁▁▅▇▁
longitude 0 1.00 -9.92e+01 4.00e-02 -9.93e+01 -9.92e+01 -9.92e+01 -9.92e+01 -9.90e+01 ▁▁▇▁▁
accommodates 0 1.00 3.05e+00 2.21e+00 1.00e+00 2.00e+00 2.00e+00 4.00e+00 5.00e+01 ▇▁▁▁▁
bathrooms 26 1.00 1.40e+00 1.00e+00 0.00e+00 1.00e+00 1.00e+00 1.50e+00 5.00e+01 ▇▁▁▁▁
bedrooms 48 1.00 1.42e+00 1.14e+00 0.00e+00 1.00e+00 1.00e+00 2.00e+00 5.00e+01 ▇▁▁▁▁
beds 274 0.99 1.83e+00 1.75e+00 0.00e+00 1.00e+00 1.00e+00 2.00e+00 5.00e+01 ▇▁▁▁▁
square_feet 21758 0.00 1.07e+03 3.99e+03 0.00e+00 0.00e+00 2.37e+02 8.40e+02 3.23e+04 ▇▁▁▁▁
price 0 1.00 1.48e+03 4.99e+03 0.00e+00 4.08e+02 7.25e+02 1.32e+03 3.50e+05 ▇▁▁▁▁
weekly_price 20852 0.04 1.26e+04 2.88e+04 6.00e+01 2.63e+03 5.21e+03 1.01e+04 3.42e+05 ▇▁▁▁▁
monthly_price 20873 0.04 4.86e+04 1.16e+05 2.00e+02 9.00e+03 1.74e+04 3.50e+04 1.51e+06 ▇▁▁▁▁
security_deposit 9696 0.56 3.28e+03 8.94e+03 0.00e+00 0.00e+00 0.00e+00 3.40e+03 1.15e+05 ▇▁▁▁▁
cleaning_fee 7439 0.66 3.61e+02 5.23e+02 0.00e+00 1.14e+02 3.00e+02 4.53e+02 2.40e+04 ▇▁▁▁▁
guests_included 0 1.00 1.62e+00 1.34e+00 1.00e+00 1.00e+00 1.00e+00 2.00e+00 2.60e+01 ▇▁▁▁▁
extra_people 0 1.00 1.23e+02 2.69e+02 0.00e+00 0.00e+00 0.00e+00 2.00e+02 6.84e+03 ▇▁▁▁▁
minimum_nights 0 1.00 4.09e+00 2.24e+01 1.00e+00 1.00e+00 2.00e+00 2.00e+00 1.12e+03 ▇▁▁▁▁
maximum_nights 0 1.00 6.86e+02 7.08e+02 1.00e+00 4.50e+01 1.12e+03 1.12e+03 5.00e+04 ▇▁▁▁▁
minimum_minimum_nights 0 1.00 3.89e+00 1.96e+01 1.00e+00 1.00e+00 2.00e+00 2.00e+00 1.12e+03 ▇▁▁▁▁
maximum_minimum_nights 0 1.00 4.20e+00 2.26e+01 1.00e+00 1.00e+00 2.00e+00 3.00e+00 1.12e+03 ▇▁▁▁▁
minimum_maximum_nights 0 1.00 8.47e+02 6.70e+02 1.00e+00 3.65e+02 1.12e+03 1.12e+03 5.00e+04 ▇▁▁▁▁
maximum_maximum_nights 0 1.00 8.49e+02 6.69e+02 1.00e+00 3.65e+02 1.12e+03 1.12e+03 5.00e+04 ▇▁▁▁▁
minimum_nights_avg_ntm 0 1.00 4.04e+00 2.06e+01 1.00e+00 1.00e+00 2.00e+00 2.00e+00 1.12e+03 ▇▁▁▁▁
maximum_nights_avg_ntm 0 1.00 8.48e+02 6.69e+02 1.00e+00 3.65e+02 1.12e+03 1.12e+03 5.00e+04 ▇▁▁▁▁
availability_30 0 1.00 1.95e+01 1.24e+01 0.00e+00 1.00e+00 2.60e+01 3.00e+01 3.00e+01 ▃▁▁▂▇
availability_60 0 1.00 4.21e+01 2.38e+01 0.00e+00 2.20e+01 5.60e+01 6.00e+01 6.00e+01 ▂▁▁▁▇
availability_90 0 1.00 6.54e+01 3.47e+01 0.00e+00 4.90e+01 8.50e+01 9.00e+01 9.00e+01 ▂▁▁▁▇
availability_365 0 1.00 2.20e+02 1.39e+02 0.00e+00 8.90e+01 2.11e+02 3.61e+02 3.65e+02 ▃▂▃▁▇
number_of_reviews 0 1.00 2.36e+01 4.18e+01 0.00e+00 1.00e+00 6.00e+00 2.80e+01 5.55e+02 ▇▁▁▁▁
number_of_reviews_ltm 0 1.00 9.18e+00 1.50e+01 0.00e+00 0.00e+00 2.00e+00 1.20e+01 1.77e+02 ▇▁▁▁▁
review_scores_rating 5616 0.74 9.51e+01 8.64e+00 2.00e+01 9.40e+01 9.70e+01 1.00e+02 1.00e+02 ▁▁▁▁▇
review_scores_accuracy 5632 0.74 9.72e+00 8.50e-01 2.00e+00 1.00e+01 1.00e+01 1.00e+01 1.00e+01 ▁▁▁▁▇
review_scores_cleanliness 5632 0.74 9.59e+00 9.20e-01 2.00e+00 9.00e+00 1.00e+01 1.00e+01 1.00e+01 ▁▁▁▁▇
review_scores_checkin 5635 0.74 9.82e+00 7.20e-01 2.00e+00 1.00e+01 1.00e+01 1.00e+01 1.00e+01 ▁▁▁▁▇
review_scores_communication 5631 0.74 9.78e+00 7.90e-01 2.00e+00 1.00e+01 1.00e+01 1.00e+01 1.00e+01 ▁▁▁▁▇
review_scores_location 5636 0.74 9.81e+00 6.80e-01 2.00e+00 1.00e+01 1.00e+01 1.00e+01 1.00e+01 ▁▁▁▁▇
review_scores_value 5637 0.74 9.58e+00 9.00e-01 2.00e+00 9.00e+00 1.00e+01 1.00e+01 1.00e+01 ▁▁▁▁▇
calculated_host_listings_count 0 1.00 6.09e+00 1.60e+01 1.00e+00 1.00e+00 2.00e+00 4.00e+00 1.57e+02 ▇▁▁▁▁
calculated_host_listings_count_entire_homes 0 1.00 4.09e+00 1.55e+01 0.00e+00 0.00e+00 1.00e+00 2.00e+00 1.57e+02 ▇▁▁▁▁
calculated_host_listings_count_private_rooms 0 1.00 1.77e+00 4.20e+00 0.00e+00 0.00e+00 1.00e+00 2.00e+00 4.80e+01 ▇▁▁▁▁
calculated_host_listings_count_shared_rooms 0 1.00 9.00e-02 8.30e-01 0.00e+00 0.00e+00 0.00e+00 0.00e+00 2.00e+01 ▇▁▁▁▁
reviews_per_month 5371 0.75 1.36e+00 1.50e+00 1.00e-02 3.00e-01 8.20e-01 1.92e+00 1.47e+01 ▇▁▁▁▁
# Use distinct to see distinct components of each variable
listings %>%  distinct(host_response_time)
host_response_time
N/A
within an hour
within a day
within a few hours
a few days or more
listings %>%  distinct(market)
market
Mexico City
Other (International)
Pochutla
Santander
Playa del Carmen
listings %>%  distinct(room_type)
room_type
Entire home/apt
Private room
Hotel room
Shared room
listings %>%  distinct(bed_type)
bed_type
Futon
Real Bed
Couch
Pull-out Sofa
Airbed
listings %>%  distinct(cancellation_policy)
cancellation_policy
flexible
moderate
strict_14_with_grace_period
super_strict_30
super_strict_60
listings %>%  distinct(experiences_offered)
experiences_offered
none
listings %>%  distinct(neighbourhood_cleansed)
neighbourhood_cleansed
Cuajimalpa de Morelos
Cuauhtémoc
Coyoacán
Miguel Hidalgo
Benito Juárez
Azcapotzalco
Iztacalco
Tlalpan
La Magdalena Contreras
Venustiano Carranza
Álvaro Obregón
Gustavo A. Madero
Iztapalapa
Xochimilco
Tláhuac
Milpa Alta

We find the following:

  • Host response time: N/A, Within an hour, WIthin a day, Within a few hours, a few days or more. We cam logically order the ranking of these values and format them as factor.
  • Market: Mexico City, Other (International), Pochutla, Sanstander, Playa del Carmen
  • Room Type: Entire home / apt, Private room, Hotel room, Shared room
  • Bed Type: Futon, Real Bed, Couch, Pull-out Sofa, Airbed. We can create a Boolean variable checking whether a real bed is included.
  • Cancellation Policy: flexible, moderate, strict 14 with grace period, super strict 30, super strict 60. We can logically order the ranking of these values and format them as factor.
  • Experiences offered: the only answer is none, presumably this features doesn’t exist in Mexico City
  • Property type: We will summarise the 30+ categories into Apartment, House, Condominium, Loft and Other (which contains the remaining options)
listings3 <- listings2 %>% 
  
  # Reorder factor variables where it is appropriate to do so
  mutate(host_response_time = factor(host_response_time, c("N/A","within an hour", "within a few hours", "within a day", "a few days or more"), 
                                     levels = c("within an hour", "within a few hours", "within a day", "a few days or more", "N/A")),
         cancellation_policy = factor(cancellation_policy, c("flexible", "moderate", "strict_14_with_grace_period", "super_strict_30", "super_strict_60"), 
                                      levels = c("flexible", "moderate", "strict_14_with_grace_period", "super_strict_30", "super_strict_60")),
         
         # Save 4 most popular property types as individual categories, and the rest as "Other"
         prop_type_simplified = case_when(property_type %in% c("Apartment","House", "Condominium","Loft") ~ property_type, 
                                          TRUE ~ "Other"),
         # Create boolean for whether a real bed is included or not
         real_bed = ifelse(bed_type == "Real Bed", TRUE, FALSE))

We have only formatted some character variables as factors. These are what we expect to be the most useful variables in factor form. The rest can formatted as factors later if needed.

Missing Values & Uniqueness

Cleaning fees and Deposits

We use the skim function again, this time to study the variables cleaning_fee and security_deposit:

skim(listings3$cleaning_fee)
Data summary
Name listings3$cleaning_fee
Number of rows 21824
Number of columns 1
_______________________
Column type frequency:
numeric 1
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
data 7439 0.66 361 523 0 114 300 453 24000 ▇▁▁▁▁
skim(listings3$security_deposit)
Data summary
Name listings3$security_deposi…
Number of rows 21824
Number of columns 1
_______________________
Column type frequency:
numeric 1
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
data 9696 0.56 3278 8935 0 0 0 3397 115498 ▇▁▁▁▁

We can see that:

  • Cleaning fee has multiple missing values. We should not exclude these as it means there is no cleaning fee, i.e. it is 0.

  • Similarly, security deposit also has multiple missing values. We should not exclude these as it means we have no deposit requirements, i.e. it is 0

For this reason, we decided to condition them to set a value of zero for all missing entries. .

listings4 <- listings3 %>%

# Set cleaning fees and security deposit to 0 if entry is NA
  mutate(cleaning_fee = case_when(is.na(cleaning_fee) ~ 0, 
                                  TRUE ~ cleaning_fee),
         security_deposit = case_when(is.na(security_deposit) ~ 0, 
                                  TRUE ~ security_deposit))

Other Missing Values

We know skim the entire data frame again to further understand other missing values

skim(listings4)
Data summary
Name listings4
Number of rows 21824
Number of columns 108
_______________________
Column type frequency:
character 38
Date 5
factor 2
logical 16
numeric 47
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
listing_url 0 1.00 34 37 0 21824 0
name 8 1.00 1 255 0 21094 0
summary 1443 0.93 1 1000 0 18458 0
space 6009 0.72 1 1000 0 14005 0
description 1141 0.95 1 1000 0 19519 0
experiences_offered 0 1.00 4 4 0 1 0
neighborhood_overview 6312 0.71 1 1000 0 12861 0
notes 13372 0.39 1 1000 0 7097 0
transit 7254 0.67 1 1000 0 12094 0
access 10332 0.53 1 1000 0 9843 0
interaction 7743 0.65 1 1000 0 11412 0
house_rules 9438 0.57 1 1000 0 10357 0
picture_url 0 1.00 35 146 0 21216 0
host_url 0 1.00 38 43 0 13139 0
host_name 0 1.00 1 35 0 4149 0
host_location 75 1.00 1 104 0 625 0
host_about 8718 0.60 1 5443 0 7251 8
host_thumbnail_url 0 1.00 55 106 0 13103 0
host_picture_url 0 1.00 57 109 0 13103 0
host_neighbourhood 9299 0.57 3 39 0 171 0
host_verifications 0 1.00 2 161 0 304 0
street 0 1.00 10 166 0 528 0
neighbourhood 4894 0.78 4 33 0 54 0
neighbourhood_cleansed 0 1.00 7 22 0 16 0
city 30 1.00 2 146 0 270 0
state 150 0.99 2 38 0 120 0
zipcode 1172 0.95 4 31 0 865 0
market 25 1.00 8 21 0 5 0
smart_location 0 1.00 6 154 0 291 0
country_code 0 1.00 2 2 0 1 0
country 0 1.00 6 6 0 1 0
property_type 0 1.00 3 23 0 37 0
room_type 0 1.00 10 15 0 4 0
bed_type 4 1.00 5 13 0 5 0
amenities 0 1.00 2 1714 0 20546 0
calendar_updated 0 1.00 5 14 0 88 0
jurisdiction_names 331 0.98 18 110 0 17 0
prop_type_simplified 0 1.00 4 11 0 5 0

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
last_scraped 0 1.00 2020-06-20 2020-06-26 2020-06-21 7
host_since 0 1.00 2009-02-03 2020-06-16 2016-11-01 2947
calendar_last_scraped 0 1.00 2020-06-20 2020-06-26 2020-06-21 7
first_review 5371 0.75 2011-07-28 2020-06-22 2018-12-04 2113
last_review 5371 0.75 2013-12-21 2020-06-25 2020-03-03 1287

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
host_response_time 0 1 FALSE 5 a f: 8450, N/A: 7865, wit: 2876, wit: 1858
cancellation_policy 1 1 FALSE 5 fle: 11310, mod: 6015, str: 4467, sup: 28

Variable type: logical

skim_variable n_missing complete_rate mean count
thumbnail_url 21824 0 NaN :
medium_url 21824 0 NaN :
xl_picture_url 21824 0 NaN :
host_is_superhost 0 1 0.34 FAL: 14359, TRU: 7465
host_has_profile_pic 0 1 1.00 TRU: 21784, FAL: 40
host_identity_verified 0 1 0.30 FAL: 15266, TRU: 6558
neighbourhood_group_cleansed 21824 0 NaN :
is_location_exact 0 1 0.83 TRU: 18037, FAL: 3787
has_availability 0 1 1.00 TRU: 21824
requires_license 0 1 0.00 FAL: 21824
license 21824 0 NaN :
instant_bookable 0 1 0.57 TRU: 12471, FAL: 9353
is_business_travel_ready 0 1 0.00 FAL: 21824
require_guest_profile_picture 0 1 0.01 FAL: 21665, TRU: 159
require_guest_phone_verification 0 1 0.01 FAL: 21666, TRU: 158
real_bed 4 1 0.99 TRU: 21664, FAL: 156

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
id 0 1.00 2.84e+07 1.13e+07 3.58e+04 2.00e+07 3.02e+07 3.86e+07 4.39e+07 ▂▃▅▅▇
scrape_id 0 1.00 2.02e+13 0.00e+00 2.02e+13 2.02e+13 2.02e+13 2.02e+13 2.02e+13 ▁▁▇▁▁
host_id 0 1.00 1.22e+08 9.68e+07 7.36e+03 3.75e+07 1.02e+08 1.94e+08 3.50e+08 ▇▅▃▃▂
host_response_rate 8450 0.61 9.10e-01 2.30e-01 0.00e+00 1.00e+00 1.00e+00 1.00e+00 1.00e+00 ▁▁▁▁▇
host_acceptance_rate 3868 0.82 8.90e-01 2.30e-01 0.00e+00 9.00e-01 9.90e-01 1.00e+00 1.00e+00 ▁▁▁▁▇
host_listings_count 0 1.00 3.04e+01 2.82e+02 0.00e+00 1.00e+00 2.00e+00 5.00e+00 3.33e+03 ▇▁▁▁▁
host_total_listings_count 0 1.00 3.04e+01 2.82e+02 0.00e+00 1.00e+00 2.00e+00 5.00e+00 3.33e+03 ▇▁▁▁▁
latitude 0 1.00 1.94e+01 5.00e-02 1.92e+01 1.94e+01 1.94e+01 1.94e+01 1.96e+01 ▁▁▅▇▁
longitude 0 1.00 -9.92e+01 4.00e-02 -9.93e+01 -9.92e+01 -9.92e+01 -9.92e+01 -9.90e+01 ▁▁▇▁▁
accommodates 0 1.00 3.05e+00 2.21e+00 1.00e+00 2.00e+00 2.00e+00 4.00e+00 5.00e+01 ▇▁▁▁▁
bathrooms 26 1.00 1.40e+00 1.00e+00 0.00e+00 1.00e+00 1.00e+00 1.50e+00 5.00e+01 ▇▁▁▁▁
bedrooms 48 1.00 1.42e+00 1.14e+00 0.00e+00 1.00e+00 1.00e+00 2.00e+00 5.00e+01 ▇▁▁▁▁
beds 274 0.99 1.83e+00 1.75e+00 0.00e+00 1.00e+00 1.00e+00 2.00e+00 5.00e+01 ▇▁▁▁▁
square_feet 21758 0.00 1.07e+03 3.99e+03 0.00e+00 0.00e+00 2.37e+02 8.40e+02 3.23e+04 ▇▁▁▁▁
price 0 1.00 1.48e+03 4.99e+03 0.00e+00 4.08e+02 7.25e+02 1.32e+03 3.50e+05 ▇▁▁▁▁
weekly_price 20852 0.04 1.26e+04 2.88e+04 6.00e+01 2.63e+03 5.21e+03 1.01e+04 3.42e+05 ▇▁▁▁▁
monthly_price 20873 0.04 4.86e+04 1.16e+05 2.00e+02 9.00e+03 1.74e+04 3.50e+04 1.51e+06 ▇▁▁▁▁
security_deposit 0 1.00 1.82e+03 6.86e+03 0.00e+00 0.00e+00 0.00e+00 2.00e+03 1.15e+05 ▇▁▁▁▁
cleaning_fee 0 1.00 2.38e+02 4.58e+02 0.00e+00 0.00e+00 1.12e+02 3.50e+02 2.40e+04 ▇▁▁▁▁
guests_included 0 1.00 1.62e+00 1.34e+00 1.00e+00 1.00e+00 1.00e+00 2.00e+00 2.60e+01 ▇▁▁▁▁
extra_people 0 1.00 1.23e+02 2.69e+02 0.00e+00 0.00e+00 0.00e+00 2.00e+02 6.84e+03 ▇▁▁▁▁
minimum_nights 0 1.00 4.09e+00 2.24e+01 1.00e+00 1.00e+00 2.00e+00 2.00e+00 1.12e+03 ▇▁▁▁▁
maximum_nights 0 1.00 6.86e+02 7.08e+02 1.00e+00 4.50e+01 1.12e+03 1.12e+03 5.00e+04 ▇▁▁▁▁
minimum_minimum_nights 0 1.00 3.89e+00 1.96e+01 1.00e+00 1.00e+00 2.00e+00 2.00e+00 1.12e+03 ▇▁▁▁▁
maximum_minimum_nights 0 1.00 4.20e+00 2.26e+01 1.00e+00 1.00e+00 2.00e+00 3.00e+00 1.12e+03 ▇▁▁▁▁
minimum_maximum_nights 0 1.00 8.47e+02 6.70e+02 1.00e+00 3.65e+02 1.12e+03 1.12e+03 5.00e+04 ▇▁▁▁▁
maximum_maximum_nights 0 1.00 8.49e+02 6.69e+02 1.00e+00 3.65e+02 1.12e+03 1.12e+03 5.00e+04 ▇▁▁▁▁
minimum_nights_avg_ntm 0 1.00 4.04e+00 2.06e+01 1.00e+00 1.00e+00 2.00e+00 2.00e+00 1.12e+03 ▇▁▁▁▁
maximum_nights_avg_ntm 0 1.00 8.48e+02 6.69e+02 1.00e+00 3.65e+02 1.12e+03 1.12e+03 5.00e+04 ▇▁▁▁▁
availability_30 0 1.00 1.95e+01 1.24e+01 0.00e+00 1.00e+00 2.60e+01 3.00e+01 3.00e+01 ▃▁▁▂▇
availability_60 0 1.00 4.21e+01 2.38e+01 0.00e+00 2.20e+01 5.60e+01 6.00e+01 6.00e+01 ▂▁▁▁▇
availability_90 0 1.00 6.54e+01 3.47e+01 0.00e+00 4.90e+01 8.50e+01 9.00e+01 9.00e+01 ▂▁▁▁▇
availability_365 0 1.00 2.20e+02 1.39e+02 0.00e+00 8.90e+01 2.11e+02 3.61e+02 3.65e+02 ▃▂▃▁▇
number_of_reviews 0 1.00 2.36e+01 4.18e+01 0.00e+00 1.00e+00 6.00e+00 2.80e+01 5.55e+02 ▇▁▁▁▁
number_of_reviews_ltm 0 1.00 9.18e+00 1.50e+01 0.00e+00 0.00e+00 2.00e+00 1.20e+01 1.77e+02 ▇▁▁▁▁
review_scores_rating 5616 0.74 9.51e+01 8.64e+00 2.00e+01 9.40e+01 9.70e+01 1.00e+02 1.00e+02 ▁▁▁▁▇
review_scores_accuracy 5632 0.74 9.72e+00 8.50e-01 2.00e+00 1.00e+01 1.00e+01 1.00e+01 1.00e+01 ▁▁▁▁▇
review_scores_cleanliness 5632 0.74 9.59e+00 9.20e-01 2.00e+00 9.00e+00 1.00e+01 1.00e+01 1.00e+01 ▁▁▁▁▇
review_scores_checkin 5635 0.74 9.82e+00 7.20e-01 2.00e+00 1.00e+01 1.00e+01 1.00e+01 1.00e+01 ▁▁▁▁▇
review_scores_communication 5631 0.74 9.78e+00 7.90e-01 2.00e+00 1.00e+01 1.00e+01 1.00e+01 1.00e+01 ▁▁▁▁▇
review_scores_location 5636 0.74 9.81e+00 6.80e-01 2.00e+00 1.00e+01 1.00e+01 1.00e+01 1.00e+01 ▁▁▁▁▇
review_scores_value 5637 0.74 9.58e+00 9.00e-01 2.00e+00 9.00e+00 1.00e+01 1.00e+01 1.00e+01 ▁▁▁▁▇
calculated_host_listings_count 0 1.00 6.09e+00 1.60e+01 1.00e+00 1.00e+00 2.00e+00 4.00e+00 1.57e+02 ▇▁▁▁▁
calculated_host_listings_count_entire_homes 0 1.00 4.09e+00 1.55e+01 0.00e+00 0.00e+00 1.00e+00 2.00e+00 1.57e+02 ▇▁▁▁▁
calculated_host_listings_count_private_rooms 0 1.00 1.77e+00 4.20e+00 0.00e+00 0.00e+00 1.00e+00 2.00e+00 4.80e+01 ▇▁▁▁▁
calculated_host_listings_count_shared_rooms 0 1.00 9.00e-02 8.30e-01 0.00e+00 0.00e+00 0.00e+00 0.00e+00 2.00e+01 ▇▁▁▁▁
reviews_per_month 5371 0.75 1.36e+00 1.50e+00 1.00e-02 3.00e-01 8.20e-01 1.92e+00 1.47e+01 ▇▁▁▁▁

We found that:

  • Weekly and monthly price are missing most data points. We will exclude them going forwards and only focus on price.
  • Square feet is missing many data points. This makes the variable essentially unusable. These will be removed in later steps.
  • Host acceptance and response rate are missing a lot of values. This will most likely mean that their listing has been their first listing. We should be careful how to treat this variable going forwards, the missing values might need to be excluded.
  • Similarly the review scores are missing for many listings. Most likely these are now listings and have not been reviewed. We need to be careful how to treat them going forwards.
  • Bed, bedrooms and bathrooms have a small amount of missing data points. We can’t safely assume these to just be 0 by nature so we would not alter them for now.

Uniqueness

  • The most important variable to look at here is lisiting url. We only have unique values here which is important so that we do not double count any properties
  • Interestingly we don’t have only unique names for the listings, nor are all descriptions unique. Even further, many image urls are duplicates - many hosts in Mexico City seem to make their lives easy by copying other listings information to their own! We don’t believe this is enough evidence to exclude any listings however.

Maximum and minimum nights

What are the most common values for the variable minimum_nights?

count_minimum_nights <- listings4 %>% 

# Count number of properties for each level of minimum nights
  group_by(minimum_nights) %>% 
  summarise(count = n()) %>%

# Arrange in descending order
    arrange(desc(count))

# Show list
count_minimum_nights

minimum_nightscount
1       10088
2       6474
3       2376
5       620
7       534
4       490
30       313
6       194
15       160
10       79
28       70
14       64
20       51
8       36
90       34
60       30
180       28
12       26
365       22
25       21
29       12
9       11
21       9
120       9
13       6
150       5
18       4
31       4
100       4
360       4
300       3
16       2
17       2
19       2
24       2
33       2
40       2
45       2
160       2
182       2
1e+03       2
11       1
22       1
26       1
35       1
39       1
54       1
55       1
61       1
63       1
80       1
89       1
92       1
93       1
99       1
132       1
179       1
183       1
350       1
363       1
400       1
500       1
1.12e+031
1.12e+031
Most popular minimum nights are: 1 night, 2 nights, 3 nights, 5 nights, 7 nights (descending order)

179 days upwards: These properties most likely use Airbnb as if it was an alternative for Zoopla (as a marketing medium to advertise their properties to potential long-term tenants)

Is there any value among the common values that stands out?

30 days (this type of accommodation is not for short-term rents)

What is the likely intended purpose for Airbnb listings with this seemingly unusual value for minimum_nights?

30-180 days: for long-term tourists or long-term business visitors. 179 days upwards: These properties most likely use Airbnb as if it was an alternative for Zoopla (as an online marketing medium to advertise their properties to potential long-term tenants)

Airbnb is most commonly used for travel purposes, i.e., as an alternative to traditional hotels. We only want to include listings in our regression analysis that are intended for travel purposes. When looking at the distribution of the minimum nights across listings we see that it is heavily left skewed. The majority of listings is as expected for short term renting, however there are some big outliers with over 100 days minimum nights stay. We will filter the data set to only include listings with a minimum night stay of <= 4. In addition, we will use our understanding of missing values in earlier parts to remove weekly prices, monthly prices, and square feet area.

# Filter properties with 4 minimum nights or less
listings5 <- listings4 %>%
  filter(minimum_nights <= 4) %>%

## Deselect unwanted variables
  select(-weekly_price, -monthly_price, -square_feet)

Informative Visualisations

Our prior analyses have highlighted a list of variables that could be interesting for further analysis. In this section we will look at them in more detail

  • Price
  • Cleaning Fee
  • Security Deposit
  • Bed
  • Bedrooms
  • Review Scores (review_scores_rating is the total score)
  • Property Types
  • Accommodates
  • Super Hosts, Host Response & Acceptance Rate (host_acceptance_rate host_listings_count )

Host Data

Histograms of host response and acceptance rates

# Investigate distribution of Host responses
listings5 %>% 
  ggplot() +
  geom_histogram(aes(x= host_response_rate)) +
  labs(title = "Most Airbnb hosts have 100% response rate",
       subtitle = "Hosts with 0% Response Rate are potentially fake listings",
       x = "Host Response Rate") +
  theme_economist()

# Investigate distribution of Host Acceptance Rates
listings5 %>% 
  ggplot() +
  geom_histogram(aes(x= host_acceptance_rate), binwidth = 0.05) +
  labs(title = "Most Airbnb hosts have over 90% acceptance rate",
       subtitle = "Hosts with 0% Acceptance Rate are potentially fake listings",
       x = "Host Acceptance Rate") +
  theme_economist()

Due to the competitiveness and strict hosting policies of Airbnb, most hosts tend to have very high acceptance and response rates.

Filtering Improper Listings

listings5 %>% 
  
  # Filter out hosts who never responded or accepted any listings
  filter(host_response_rate == 0, 
         host_acceptance_rate == 0) %>% 
  summarise(number_of_potentially_fake_listings = n())
number_of_potentially_fake_listings
40
# Count the number of hosts to be excluded
listings5 %>% 
  filter(host_acceptance_rate <= 0) %>% 
  summarise(to_be_removed = n())
to_be_removed
633

We can see that a substantial amount of hosts have never responded to requests and/or approved a request. We can be sure that hosts that don’t either respond to or accept any request are likely to be fake accounts. We can count 40 of these cases.

Extending our reasoning into acceptance rate and response rate individually, we believe hosts that don’t respond to requests but accept them might just be lazy, so we should not exclude them. However, hosts that don’t accept any request are potentially just using AirBnB to test the market and see whether there is any interest in the flat. Alternatively, they might have inflated price expectations that no one is willing to pay, or they are simply very new listings. Either way, we decided to exclude these 629 properties because they most likely are not good representations of the Airbnb market. This group already encompasses the aforementioned 40 listings that neither respond or accept requests, so we can just filter out any listings with 0 acceptance rate.

# Exclude all listings with hosts that do not respond or have never accepted 
listings6 <- listings5 %>% 
  filter(host_acceptance_rate > 0)
# Calculate percentage of superhosts
listings6 %>% 
  mutate(superhosts = ifelse(host_is_superhost == TRUE,1,0)) %>% 
  summarise(hosts = n(),
            superhosts = sum(superhosts)) %>% 
  summarise(perc_superhosts = superhosts/hosts)
perc_superhosts
0.433

We also found the percentage of superhosts to be at 43.8% - a healthy percentage of all hosts in Mexico City therefore regularly rents out their properties. Going forwards it will be interesting to see whether this has an effect on any other variables - especially price.

Price

Outliers Included

With extreme outliers included, prices are heavily right skewed. Thus, boxplots of listings price are not visually comprehensible. Instead, boxplots of log prices will be plotted.

# Create simple density plot
listings6 %>%
  ggplot() +
  geom_histogram( aes(x = price), binwidth = 100)  +
  labs(title = "Price histogram heavily right-skewed",
       subtitle = "Excessive price outliers should be investigated",
       x = "Price per night") +
  theme_economist() +
  
  # Add dollar signs
  scale_x_continuous(labels = scales::dollar_format())

# look for outliers - create boxplot on log of price data
listings6 %>%
  mutate(log_price = log(price)) %>%
  group_by(prop_type_simplified) %>% 
  ggplot() +
  geom_boxplot(aes(x = reorder(prop_type_simplified, log_price, FUN = median), y = log_price)) +
  theme(axis.title.x = element_blank()) +
  labs(title = " Exponential price distribution shows that Houses are \n priced significantly lower than other property types",
       subtitle = " Excessive price outliers should be excluded",
       y = "Price - Log-Scale",
       x = "Property type") +
  theme_economist()

We saw a very skewed distribution of price. Outliers that charge for example over USD7,400 per night (or exp(8.91)) for a property do not seem to be legitimite properties and were therefore immediately excluded.

Extreme Outliers EXcluded

Filtering extreme outliers allow for a boxplot of non-log listing prices to be plotted. Here, we choose to filter out extreme outliers, which we defined as listings outside the 1.5 x IQR of log prices. The reason for using log-price outliers instead of price outliers is that this will exclude only the most unrealistically priced accommodations, and not those that are just quite fancy and luxurious.

# Find outliers that are to be excluded
exp(boxplot.stats(log(listings6$price))$out)
##   [1]   9036  11275  10036   8830  33961   8438  10938   7924  10370  15849
##  [11]   8232   9021  18113  20509  24821  10188  31697  23162  15011  11280
##  [21]  10188   9057   7924   8908  12453  20382  11276  13154   8151   7994
##  [31]   8842  10802  21854   9472  10665      0   9059     45     45  14041
##  [41]   7926  10008  12592   9059      0      0   7994  16981   7926  13520
##  [51]  17891  17891  19386   8266  22624   7926  11323  10395  12113  11320
##  [61]  11867  25811   7992  15143   7924  11323   9001  10007   7893  26900
##  [71]  99994  16256   7924   8447   8536  10007  14856  13588  11210   9059
##  [81]   9059   7976   8264  19994  10007   8490  31682  66379   9172  13673
##  [91]   8989   7926  18113   7992   8832  29999   8991   7926   9001  13996
## [101]   9059  10191  11320   9918  31946  10352  10191   7926   9917  12453
## [111]   9901  12376  10259  11323 226409   9782  13673  13585  29624  67940
## [121]  11321  15952  20379   8490  13171  13673  13157  49996     23  22641
## [131]  26730   7924   7893  34505  13588   7926   8198  11751     45  16098
## [141]  15008   9991     46  99996  35026  13588  13585  56617     45  18815
## [151]   9011      0      0  51068      0  12001   7926  19405  15849   8991
## [161]  13673   9115  15850  24022  33554  33954  33554  33676  33676  33761
## [171]  33676  33676  33676  33557  33557  33676  33676  33554  33954  33554
## [181]  33554  33554  33554  33676  33554  33954  33954  33557  33557  33676
## [191]  33676  33557  33676  33557  33676  33676  33554  33554  33676  33557
## [201]  33557  33557  33557  33554  33554  33954  33554  33954  33557  33954
## [211]  33527  33954  33557  33527  33527  33557  33557  33761  33554  33676
## [221]  33676  33557  33557  33954  33557  33557  33954  33554  33554  33554
## [231]  33954  33954  33557  33954  33954  33557  33557  33954  33557  33557
## [241]  33954  33954  33557  33557  33557  33557  33954  33557  33954  33554
## [251]  33954  33557  33557  33954  33527  33527  33527  33527  33954  33557
## [261]  33557  33554  33954  33557  33954  33954  33557  33557  33557  33557
## [271]  33557  33954  33554  33554  33954  33554  33554  33557  33557  33557
## [281]  33557  33954  33954  33554  33554  33676  33676  33761  33676  33676
## [291]  33676  33554  33954  33557  33557  33527  33527  33527  33527  33676
## [301]  33676  33557  33954  33527  33527  33761  33554  33761  33954  33954
## [311]  33557  33554  33554  33554  33554  33557  33557  33554  33554  33676
## [321]  33676   8309   9489   7994   7994
# exclude log(price outliers)
listings7 <- listings6 %>% 
  filter(!log(price) %in% boxplot.stats(log(price))$out)

# Check new distribution
listings7 %>%
  ggplot() +
  geom_boxplot(aes(x = reorder(prop_type_simplified, price, FUN = median), y=price)) +
  labs(subtitle = "Price Distribution by Property Type",
       title = "Price outliers within reasonable range",
       y = "Price per night",
       x = "Property type") +
  theme(axis.title.x = element_blank())+
  theme_economist() +
  
  # Add dollar signs
  scale_y_continuous(labels = scales::dollar_format())

# Check new distribution
listings7 %>%
  ggplot() +
  geom_density(aes(x=price)) +
  
  # Facet by property type
  facet_wrap(~prop_type_simplified, scale = "free_x") +
  
  labs(subtitle = "Price Distribution by Property Type",
       title = "Price outliers within reasonable range",
       x = "Price per night (USD)") +
  theme(axis.title.x = element_blank())+
  theme_economist() 

Our new distribution of price is more visually comprehensible, however when looking at density plots we still find a heavy right-skewness, which should be taken into account going forwards, especially as AirBnB’s maximum price filter is USD1000+

Review Scores

# CLook at distribution to ratings
listings7 %>%
  
  # Change facet name of superhost (TRUE = superhost, FALSE, = not superhost)
  mutate(host_is_superhost = if_else(host_is_superhost, "Superhost", "Not Superhost")) %>%
  
  # Plot histogram
  ggplot() +
  geom_histogram( aes(x = review_scores_rating), binwidth = 1)  +
  labs(subtitle = "Irregularities at ratings of 20, 40, 60, and 80 may be due to 
       new listings with insufficient data, or AirBnB rounding algorithms for lower values",
       title = "Ratings are left skewed - more so for superhosts") +
  facet_wrap(~host_is_superhost) + 
  theme_economist()

# Investigate relation of Rating to Price
listings7 %>%
  ggplot() +
  geom_point( aes(x = review_scores_rating, y = price))  +
  geom_smooth(aes(x = review_scores_rating, y = price), method='lm') +
  labs(subtitle = "Irregularities at ratings of 20,and 40 may be due to 
       new listings with insufficient data, or AirBnB rounding algorithms for lower values",
       title = "Well rated properties seem to charge more on average \n - but driven by hugely expensive properties") +
  theme_economist()

Looking at review scores we saw that the ratings are left heavily skewed. This could likely be explained by the reciprocal rating mechanism of AirBnB, where hosts also rate the tenants and the tenants’ score impacts their likelihood of them being accepted to a new booking. Going forwards it could be considered to disregard low rated properties (e.g. <5) as these will have had substantial problems and might be outliers. A normal traveller wouldn’t consider properties with such low ratings.

That being said we saw a slight over-performance of superhosts compared to normal hosts. We also saw a tendency of better-rated properties to charge a higher price. However this seems to be especially driven by expensive properties.

Cleaning Fee

# Look at distribution of cleaning fees
listings7 %>%
  ggplot() +
  geom_histogram( aes(x = cleaning_fee))  +
  labs(subtitle = "These may be for more expensive properties",
       title = "Some extraordinary cleaning fees are charged",
       x = "Cleaning fees") +
  theme_economist() +
  
  # Add dollar sign
  scale_x_continuous(labels = scales::dollar_format())

# Relate Cleaning fee to Price of minimum stay
listings7 %>%
  filter(cleaning_fee >= 0) %>% 
  mutate(clean_to_price_min = cleaning_fee / (price * minimum_nights)) %>% 
  ggplot() +
  geom_boxplot( aes(y = clean_to_price_min))  +
  labs(title = "Cleaning Fee to Price Ratio",
       subtitle = "In relation to price some cleaning fees are non-sensical and should be excluded") +
  theme_economist()

# exclude clean to price outliers
listings8 <- listings7 %>% 
  mutate(clean_to_price_min = cleaning_fee / (price * minimum_nights)) %>% 
  filter(!clean_to_price_min %in% boxplot.stats(clean_to_price_min)$out)

# Find outliers that are being excluded
min(exp(boxplot.stats(listings7$cleaning_fee / (listings7$price * listings7$minimum_nights))$out))
## [1] 1.97
# Investigate relation of Rating to Price
listings8 %>%
  ggplot() +
  geom_point( aes(x = cleaning_fee, y = price))  +
  geom_smooth(aes(x = cleaning_fee, y = price), method='lm') +
  labs(subtitle = "Cleaning Fee vs Price",
       title = "If cleaning fees are charged they increase in relation to price") +
  theme_economist()

We initially looked at the distribution of cleaning fees and found some extraordinarily high fees being charged on some properties. Realising that a mansion costing USD1000+ per night might also be very expensive to clean, we decided to look at a different proxy for the logic behind our cleaning fee numbers - the cleaning fee to price per minimum stay ratio.

Plotting the boxplot of this ratio we still find many outliers that charge more than 2x the minimum stay price as a cleaning fee, which is not reasonable for a short-term holiday stay. This pricing strategy might be used to trick tenants to look at more expensive properties by making the price per night price artificially low. Whereas this might be legitimate it will significantly skew our findings as we can no longer understand whether part of a cleaning fee is actually the price per night. We therefore excluded all outliers of over 1.97x the cleaning fee to minimum stay price ratio.

Accommodates (House Size)

## Create histogram to understand distribution of house size
listings8 %>% 
  ggplot()+
  geom_histogram(aes(x= accommodates), binwidth = 1)  +
  labs(subtitle = "Accommodates (House Size) Histogram",
       title = "Most properties accommodate less than 5 guests") +
  theme_economist()

## Create scatterplot of house size and price
listings8 %>% 
  ggplot()+
  geom_point(aes(x= accommodates, y = price)) + 
  
  # add smoothing line
  geom_smooth(aes(x= accommodates, y = price), method = "lm") +
  
  # add titles
  labs(subtitle = "House Size vs Price",
       title = "Propoerties that can accommodate more guests \ntend to be more expensive",
       y = "Daily prices (USD)") +
  theme_economist()

Looking at the accommodation variable we can see that the majority of properties in Mexico City accommodate less than 5 guests. We also find that accommodation, as a proxy for house size, is positively correlated to price.

Correlation Analysis

# Look at distribution to ratings
listings8 %>%
  
  summarise(log(price), review_scores_rating, accommodates, security_deposit, host_is_superhost) %>% 
  ggpairs(aes(colour = host_is_superhost, alpha = 0.4), 
          title = "Light blue represents superhosts, red represents normal hosts")

# ggsave to resize image
ggsave("combined_graphs.jpg", plot = last_plot(), width = 34, height = 18, units ="cm", dpi = 100) 

knitr::include_graphics(here::here("content/projects/project4", "combined_graphs.jpg"), error = FALSE)

The graphic above shows us a lot about our data.

The intuitive conclusions:

  • Price and cleaning fee are positively correlated. This is intuitive as more expensive properties will be more expensive to clean
  • Price and house size (accommodates) are strongly and positively correlated. This is obvious too, as bigger houses most likely cost more to invest
  • Review ratings and price are also strongly and positively correlated. The coefficient may be small, but remember that ratings are on a scale of 100, so an increase of 5 rating points make a considerable difference!
  • The higher the prices, the higher the security deposit.
  • Superhosts on average charge higher prices than normal hosts.

The less expected information:

  • House size is negatively correlated with ratings. This may be because those properties that can accommodate more people are budget properties where quantity can replace quality.

Mapping

# create colour palette for property type
factpal <- colorFactor(topo.colors(5), listings8$property_type)

# create colour palette for price
percentile_price <- listings8 %>% 
  summarise(percentile_price = (price - min(price))/ (max(price) - min(price)))

# add dollar signs for price
listings8 <- listings8 %>% 
  mutate(content = paste(property_type, "for $",(price), sep = " "))

# Create map
leaflet(listings8) %>% 
  addProviderTiles("OpenStreetMap.Mapnik") %>% 
  addCircleMarkers(lng = ~longitude, 
                   lat = ~latitude, 
                   radius = 1, 
                   color = ~factpal(property_type), 
                   popup = ~listing_url,
                   label = ~content)

Regression Analysis

For the target variable , we will use the cost for two people to stay at an Airbnb location for four (4) nights.

First, we check the data to see if there is any property that will only rent for less than 4 nights (maximum_nights < 4). This may be unrealistic, but it is possible that some hosts have a preference for short stays only, or that these listings are fake. Either way, if any property has a maximum nights of below 4, they need to be excluded.

count_maximum_nights <- listings8 %>% 

# Count number of properties for each level of minimum nights
  group_by(maximum_nights) %>% 
  summarise(count = n()) %>%

# Arrange in descending order
    arrange(maximum_nights)

# Show list
count_maximum_nights
maximum_nightscount
1       21
2       31
3       56
4       48
5       95
6       49
7       254
8       61
9       7
10       169
11       9
12       27
13       2
14       79
15       328
16       13
17       9
18       8
19       2
20       116
21       36
22       4
23       4
24       6
25       25
26       2
27       8
28       174
29       47
30       1268
31       241
32       38
33       6
34       2
35       46
36       9
37       2
38       4
39       1
40       48
41       2
42       3
43       1
44       1
45       79
46       3
48       1
49       2
50       30
51       5
53       1
55       4
56       1
57       1
59       2
60       463
61       6
62       20
63       2
64       2
65       15
66       2
67       1
69       1
70       6
71       1
75       4
80       10
81       1
84       2
89       3
90       540
91       9
92       6
93       59
94       1
95       8
98       1
99       7
100       114
101       2
102       2
104       1
107       1
110       1
111       1
113       1
119       2
120       116
124       3
125       2
130       5
135       3
140       2
150       29
152       1
154       1
160       3
168       1
169       1
175       1
179       1
180       208
182       15
185       35
186       5
187       3
188       1
189       2
190       5
200       32
208       1
210       10
221       1
224       1
238       1
240       9
250       4
260       3
269       1
270       3
279       1
300       53
301       1
320       2
322       3
325       1
330       1
333       1
340       2
350       8
355       1
358       4
359       1
360       89
361       1
362       1
363       1
364       14
365       522
366       10
369       2
370       2
375       1
376       1
380       1
383       1
390       1
398       2
400       16
444       1
450       1
470       1
500       21
600       10
700       7
730       11
800       1
900       3
902       1
999       25
1e+03       50
1e+03       1
1e+03       1
1.08e+031
1.1e+03 1
1.1e+03 1
1.1e+03 3
1.11e+034
1.12e+032
1.12e+034
1.12e+032
1.12e+031
1.12e+033
1.12e+0360
1.12e+038498
1.13e+031
1.13e+031
2e+03       1
3e+03       1
5e+03       2
1e+04       2
5e+04       2

We can see from the table above that there are over 100 accommodations with the maximum nights below 4. These will be excluded.

Next, we create a new variable called price_4_nights that uses price, cleaning_fee, accommodates, guests_included, and extra_people to calculate the total cost for two people to stay at the Airbnb property for 4 nights. This is will be our dependent variable.

Note: we first worked with the definition that extra_people is the charge for having more than 1 person. However, after further research, we have realised that this is INCORRECT. Extra_people fee is, in fact, the per person charge for every additional guest exceeding those included in the original prices, as long as the number of guests is not higher than the maximum number that the property accommodates. We will perform our analysis using this new definition.

listings9 <- listings8 %>% 
  
  # Filter out those with maximum nights below 4 or accommodates less than 2 as we cannot rent there
  filter(maximum_nights >= 4,
         accommodates >= 2) %>%
  
  # Create new variable
  mutate(price_4_nights = if_else(guests_included >= 2,
                                  price * 4 + cleaning_fee, # Price if original rates already include 2 people or above 
                                  price * 4 + cleaning_fee + extra_people)) # Price if original rates only include 1 person

Use histograms or density plots to examine the distributions of price_4_nights and log(price_4_nights). Which variable should we use for the regression model? Why?

# Create simple density plot
listings9 %>%
  ggplot() +
  geom_density( aes(x = price_4_nights), binwidth = 100)  +
  labs(title = "Density plot of prices for 4 nights heavily right-skewed",
       subtitle = "There is a strict zero lower bound to prices, but no strict upper bound",
       x = "Total price of accommodation for 4 nights (USD)") +
  theme_economist()

# Create logged density plot
listings9 %>%
  ggplot() +
  geom_density( aes(x = log(price_4_nights) ), binwidth = 100)  +
  labs(subtitle = "Log 4-nights prices is closer to normal distribution",
       title = "No obvious skewed patterns",
      x = "Log of total price of accommodation for 4 nights") +
  theme_economist()

It looks like we should use the log-transformed price_4_nights variable as it is significantly less skewed and closer to normal distribution.

Review scores rating, number of reviews, and property type

Our first regression model called model1 will start with the following explanatory variables: prop_type_simplified, number_of_reviews, and review_scores_rating.

# Create regression
model1 <- lm(log(price_4_nights) ~ prop_type_simplified + number_of_reviews + review_scores_rating, data = listings9)

summary(model1)
## 
## Call:
## lm(formula = log(price_4_nights) ~ prop_type_simplified + number_of_reviews + 
##     review_scores_rating, data = listings9)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.7723 -0.5000 -0.0178  0.4460  2.5209 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                      7.828347   0.089266   87.70  < 2e-16 ***
## prop_type_simplifiedCondominium  0.015047   0.025168    0.60  0.54995    
## prop_type_simplifiedHouse       -0.414243   0.020150  -20.56  < 2e-16 ***
## prop_type_simplifiedLoft         0.002488   0.026191    0.10  0.92431    
## prop_type_simplifiedOther       -0.110432   0.023618   -4.68    3e-06 ***
## number_of_reviews                0.000467   0.000133    3.51  0.00044 ***
## review_scores_rating             0.004775   0.000932    5.12    3e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.694 on 11335 degrees of freedom
##   (1572 observations deleted due to missingness)
## Multiple R-squared:  0.0428, Adjusted R-squared:  0.0423 
## F-statistic: 84.6 on 6 and 11335 DF,  p-value: <2e-16
autoplot(model1) + theme_bw()

# Check for collinearity
vif(model1)
##                      GVIF Df GVIF^(1/(2*Df))
## prop_type_simplified 1.03  4            1.00
## number_of_reviews    1.03  1            1.01
## review_scores_rating 1.01  1            1.00
# Summarise model estimates in dataframe
model1_estimates <- summary(model1)$coefficients[,1] 
model1_estimates <- exp(model1_estimates)
model1_estimates <- stack(model1_estimates)
model1_estimates <- model1_estimates %>% 
  select(ind, values)
model1_estimates
indvalues
(Intercept)2.51e+03
prop_type_simplifiedCondominium1.02    
prop_type_simplifiedHouse0.661   
prop_type_simplifiedLoft1       
prop_type_simplifiedOther0.895   
number_of_reviews1       
review_scores_rating1       

We don’t find any high levels of collinearity in terms of VIF. We therefore don’t have to exclude any independent variables.

Interpreting the coefficient review_scores_rating in terms of price_4_nights.

The effect of review score ratings on 4-nights rents is of strong statistical significance (it is non-zero even at 99.99% significance level). Ceteris paribus, an increase in review score ratings of 1 is expected to induce a 0.48% increase in 4-nights prices.

Interpreting the coefficient of prop_type_simplified in terms of price_4_nights.

Looking at our property types coefficients and p-values, we can find that changing from renting Apartments to renting ‘House’ and ‘Other’ induce effects of strong statistical significance (at least at 99% level of significance) on 4-nights rents:

  • Ceteris paribus, changing from renting Apartments to renting House is expected to lead to a 41.4% decrease in 4-nights prices.

  • Ceteris paribus, changing from renting Apartments to renting ‘Other’ is expected to lead to a 11% decrease in 4-nights prices.

Meanwhile, the effect of renting Lofts or Condominium is not statistically different from that of renting Apartments.

Adding Room Type

We want to determine if room_type is a significant predictor of the cost for 4 nights, given everything else in the model. Fit a regression model called model2 that includes all of the explanatory variables in model1 plus room_type.

# Create regression
model2 <- lm(log(price_4_nights) ~ prop_type_simplified + number_of_reviews + review_scores_rating + room_type, 
             data = listings9)

summary(model2)
## 
## Call:
## lm(formula = log(price_4_nights) ~ prop_type_simplified + number_of_reviews + 
##     review_scores_rating + room_type, data = listings9)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -1.979 -0.376 -0.043  0.336  2.668 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                      8.137277   0.073236  111.11  < 2e-16 ***
## prop_type_simplifiedCondominium  0.037686   0.020610    1.83  0.06750 .  
## prop_type_simplifiedHouse        0.073227   0.017757    4.12  3.8e-05 ***
## prop_type_simplifiedLoft        -0.142949   0.021538   -6.64  3.3e-11 ***
## prop_type_simplifiedOther        0.089832   0.021120    4.25  2.1e-05 ***
## number_of_reviews               -0.000378   0.000109   -3.45  0.00056 ***
## review_scores_rating             0.004462   0.000763    5.85  5.2e-09 ***
## room_typeHotel room             -0.092666   0.051673   -1.79  0.07295 .  
## room_typePrivate room           -0.918969   0.012519  -73.41  < 2e-16 ***
## room_typeShared room            -1.214563   0.066764  -18.19  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.568 on 11332 degrees of freedom
##   (1572 observations deleted due to missingness)
## Multiple R-squared:  0.359,  Adjusted R-squared:  0.358 
## F-statistic:  704 on 9 and 11332 DF,  p-value: <2e-16
autoplot(model2) + theme_bw()

# Check for collinearity
vif(model2)
##                      GVIF Df GVIF^(1/(2*Df))
## prop_type_simplified 1.43  4            1.05
## number_of_reviews    1.04  1            1.02
## review_scores_rating 1.01  1            1.00
## room_type            1.43  3            1.06
# Summarise model estimates in dataframe
model2_estimates <- summary(model2)$coefficients[,1] 
model2_estimates <- exp(model2_estimates)
model2_estimates <- stack(model2_estimates)
model2_estimates <- model2_estimates %>% 
  select(ind, values)
model2_estimates

indvalues
(Intercept)3.42e+03
prop_type_simplifiedCondominium1.04    
prop_type_simplifiedHouse1.08    
prop_type_simplifiedLoft0.867   
prop_type_simplifiedOther1.09    
number_of_reviews1       
review_scores_rating1       
room_typeHotel room0.911   
room_typePrivate room0.399   
room_typeShared room0.297   
We don’t find any high levels of collinearity in terms of VIF. We therefore don’t have to exclude any independent variables.

We can see that room type is a statistically significant determinant of 4-night prices:

  • Moving from renting entire house/apartment to renting private rooms is expected to induce a price decrease of 91.9%

  • Moving from renting entire house/apartment to renting shared rooms is expected to induce a price decrease of 121.5%. However, this is an unrealistic assumption, as it means that we can expect to receive money if we were to rent shared rooms. Hence, we can only interpret from the data that renting a shared room will induce a percentage fall in 4-night prices even greater than that of renting private rooms.

Further variables/questions to explore

Our dataset has many more variables, so here we are going to extend our analysis to further explore the data.

Are the number of bathrooms, bedrooms, beds, or size of the house (accommodates) significant predictors of price_4_nights?

As a standalone model

# Create regression
model3 <- lm(log(price_4_nights) ~ beds + bathrooms + accommodates, 
             data = listings9)

summary(model3)
## 
## Call:
## lm(formula = log(price_4_nights) ~ beds + bathrooms + accommodates, 
##     data = listings9)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.947 -0.426  0.000  0.427  2.223 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   7.64672    0.01201   636.6   <2e-16 ***
## beds         -0.07745    0.00522   -14.8   <2e-16 ***
## bathrooms     0.10910    0.00700    15.6   <2e-16 ***
## accommodates  0.16901    0.00396    42.6   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.638 on 12811 degrees of freedom
##   (99 observations deleted due to missingness)
## Multiple R-squared:  0.209,  Adjusted R-squared:  0.209 
## F-statistic: 1.13e+03 on 3 and 12811 DF,  p-value: <2e-16
autoplot(model3) + theme_bw()

# Check for collinearity
vif(model3)
##         beds    bathrooms accommodates 
##         2.73         1.55         2.28
# Summarise model estimates in dataframe
model3_estimates <- summary(model3)$coefficients[,1] 
model3_estimates <- exp(model3_estimates)
model3_estimates <- stack(model3_estimates)
model3_estimates <- model3_estimates %>% 
  select(ind, values)

model3_estimates

indvalues
(Intercept)2.09e+03
beds0.925   
bathrooms1.12    
accommodates1.18    
As expected, the VIFs are higher here because these variables are more correlated than those in previous regressions. However, they are still not high enough to cause significant problems, so we therefore don’t have to exclude any independent variables.

The number of beds, bathrooms, and the maximum number of people accommodated by a property have statistically significant effects on prices. From the data, adding 1 bathroom is expected to increase 4-night prices by 10.9% and expanding the maximum number of people accommodated by 1 is expected to increase 4-night prices by 16.9%. However, from the data it seems that by adding 1 additional bed, we would expect 4-night prices to fall by 7.6%. This seems illogical. However, it may be because houses that register more beds are those with many temporary, fold-able beds or sofa beds. This type of setting may make the property looks messy, so it only suits cheaper budget accommodations where the presentation of the property is of lesser importance.

As part of extended model

# Create regression
model4 <- lm(log(price_4_nights) ~ beds + bathrooms + accommodates + prop_type_simplified + number_of_reviews + review_scores_rating + room_type, 
             data = listings9)

summary(model4)
## 
## Call:
## lm(formula = log(price_4_nights) ~ beds + bathrooms + accommodates + 
##     prop_type_simplified + number_of_reviews + review_scores_rating + 
##     room_type, data = listings9)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.954 -0.347 -0.026  0.318  2.582 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                      7.589522   0.070477  107.69  < 2e-16 ***
## beds                            -0.059573   0.004866  -12.24  < 2e-16 ***
## bathrooms                        0.128271   0.006729   19.06  < 2e-16 ***
## accommodates                     0.103888   0.004034   25.75  < 2e-16 ***
## prop_type_simplifiedCondominium  0.016324   0.019315    0.85   0.3980    
## prop_type_simplifiedHouse       -0.016402   0.016813   -0.98   0.3293    
## prop_type_simplifiedLoft        -0.017101   0.020425   -0.84   0.4025    
## prop_type_simplifiedOther        0.084507   0.019874    4.25  2.1e-05 ***
## number_of_reviews               -0.000284   0.000103   -2.77   0.0057 ** 
## review_scores_rating             0.004974   0.000720    6.91  5.1e-12 ***
## room_typeHotel room              0.000705   0.048484    0.01   0.9884    
## room_typePrivate room           -0.724805   0.012969  -55.89  < 2e-16 ***
## room_typeShared room            -1.219237   0.063750  -19.13  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.531 on 11280 degrees of freedom
##   (1621 observations deleted due to missingness)
## Multiple R-squared:  0.44,   Adjusted R-squared:  0.439 
## F-statistic:  738 on 12 and 11280 DF,  p-value: <2e-16
autoplot(model4) + theme_bw()

# Check for collinearity
vif(model4)
##                      GVIF Df GVIF^(1/(2*Df))
## beds                 2.88  1            1.70
## bathrooms            1.54  1            1.24
## accommodates         2.83  1            1.68
## prop_type_simplified 1.52  4            1.05
## number_of_reviews    1.04  1            1.02
## review_scores_rating 1.01  1            1.00
## room_type            1.84  3            1.11
# Summarise model estimates in dataframe
model4_estimates <- summary(model4)$coefficients[,1] 
model4_estimates <- exp(model4_estimates)
model4_estimates <- stack(model4_estimates)
model4_estimates <- model4_estimates %>% 
  select(ind, values)

model4_estimates

indvalues
(Intercept)1.98e+03
beds0.942   
bathrooms1.14    
accommodates1.11    
prop_type_simplifiedCondominium1.02    
prop_type_simplifiedHouse0.984   
prop_type_simplifiedLoft0.983   
prop_type_simplifiedOther1.09    
number_of_reviews1       
review_scores_rating1       
room_typeHotel room1       
room_typePrivate room0.484   
room_typeShared room0.295   
Similarly, the VIFs here also higher. However, they are still not high enough to cause significant problems, so we therefore don’t have to exclude any independent variables.

Again, the number of bathrooms and the maximum number of people accommodated by the listing are positively correlated to 4-nights prices, while beds quantity is negatively correlated. However, what is more interesting here is that adding these 3 variables render the effect of most property types statistically insignificant.

Do superhosts (host_is_superhost) command a pricing premium, after controlling for other variables?

# Create regression
model5 <- lm(log(price_4_nights) ~ prop_type_simplified + beds + bathrooms + accommodates + number_of_reviews + review_scores_rating + room_type + host_is_superhost, 
             data = listings9)

summary(model5)
## 
## Call:
## lm(formula = log(price_4_nights) ~ prop_type_simplified + beds + 
##     bathrooms + accommodates + number_of_reviews + review_scores_rating + 
##     room_type + host_is_superhost, data = listings9)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.901 -0.346 -0.024  0.315  2.628 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                      7.721884   0.071928  107.36  < 2e-16 ***
## prop_type_simplifiedCondominium  0.016660   0.019253    0.87     0.39    
## prop_type_simplifiedHouse       -0.020564   0.016767   -1.23     0.22    
## prop_type_simplifiedLoft        -0.010518   0.020374   -0.52     0.61    
## prop_type_simplifiedOther        0.088276   0.019816    4.45  8.5e-06 ***
## beds                            -0.059010   0.004851  -12.16  < 2e-16 ***
## bathrooms                        0.127685   0.006707   19.04  < 2e-16 ***
## accommodates                     0.103890   0.004021   25.84  < 2e-16 ***
## number_of_reviews               -0.000455   0.000104   -4.37  1.3e-05 ***
## review_scores_rating             0.003145   0.000749    4.20  2.7e-05 ***
## room_typeHotel room              0.006774   0.048334    0.14     0.89    
## room_typePrivate room           -0.716900   0.012960  -55.32  < 2e-16 ***
## room_typeShared room            -1.211779   0.063552  -19.07  < 2e-16 ***
## host_is_superhostTRUE            0.091631   0.010686    8.57  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.53 on 11279 degrees of freedom
##   (1621 observations deleted due to missingness)
## Multiple R-squared:  0.443,  Adjusted R-squared:  0.443 
## F-statistic:  691 on 13 and 11279 DF,  p-value: <2e-16
autoplot(model5) + theme_bw()

# Check for collinearity
vif(model5)
##                      GVIF Df GVIF^(1/(2*Df))
## prop_type_simplified 1.52  4            1.05
## beds                 2.88  1            1.70
## bathrooms            1.54  1            1.24
## accommodates         2.83  1            1.68
## number_of_reviews    1.08  1            1.04
## review_scores_rating 1.10  1            1.05
## room_type            1.85  3            1.11
## host_is_superhost    1.15  1            1.07
# Summarise model estimates in dataframe
model5_estimates <- summary(model5)$coefficients[,1] 
model5_estimates <- exp(model5_estimates)
model5_estimates <- stack(model5_estimates)
model5_estimates <- model5_estimates %>% 
  select(ind, values)

model5_estimates

indvalues
(Intercept)2.26e+03
prop_type_simplifiedCondominium1.02    
prop_type_simplifiedHouse0.98    
prop_type_simplifiedLoft0.99    
prop_type_simplifiedOther1.09    
beds0.943   
bathrooms1.14    
accommodates1.11    
number_of_reviews1       
review_scores_rating1       
room_typeHotel room1.01    
room_typePrivate room0.488   
room_typeShared room0.298   
host_is_superhostTRUE1.1     
We don’t find any high levels of collinearity in terms of VIF. We therefore don’t have to exclude any independent variables.

The effect of being a superhost on 4-nights prices is statistically significant: Ceteris paribus, staying with a superhost is expected to lead to an increase in 4-nights rents of 9.2%.

Most owners advertise the exact location of their listing (is_location_exact == TRUE), while a non-trivial proportion don’t.

# Create regression
model6 <- lm(log(price_4_nights) ~ prop_type_simplified + beds + bathrooms + accommodates + number_of_reviews + review_scores_rating + room_type + host_is_superhost + is_location_exact, 
             data = listings9)

summary(model6)
## 
## Call:
## lm(formula = log(price_4_nights) ~ prop_type_simplified + beds + 
##     bathrooms + accommodates + number_of_reviews + review_scores_rating + 
##     room_type + host_is_superhost + is_location_exact, data = listings9)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.911 -0.346 -0.025  0.312  2.610 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                      7.645062   0.072699  105.16  < 2e-16 ***
## prop_type_simplifiedCondominium  0.016723   0.019215    0.87     0.38    
## prop_type_simplifiedHouse       -0.016839   0.016743   -1.01     0.31    
## prop_type_simplifiedLoft        -0.012194   0.020336   -0.60     0.55    
## prop_type_simplifiedOther        0.089465   0.019778    4.52  6.1e-06 ***
## beds                            -0.058622   0.004842  -12.11  < 2e-16 ***
## bathrooms                        0.127357   0.006695   19.02  < 2e-16 ***
## accommodates                     0.103462   0.004014   25.78  < 2e-16 ***
## number_of_reviews               -0.000438   0.000104   -4.22  2.5e-05 ***
## review_scores_rating             0.003091   0.000747    4.14  3.5e-05 ***
## room_typeHotel room              0.004840   0.048241    0.10     0.92    
## room_typePrivate room           -0.715511   0.012937  -55.31  < 2e-16 ***
## room_typeShared room            -1.217241   0.063434  -19.19  < 2e-16 ***
## host_is_superhostTRUE            0.090513   0.010667    8.49  < 2e-16 ***
## is_location_exactTRUE            0.095795   0.014306    6.70  2.2e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.528 on 11278 degrees of freedom
##   (1621 observations deleted due to missingness)
## Multiple R-squared:  0.446,  Adjusted R-squared:  0.445 
## F-statistic:  647 on 14 and 11278 DF,  p-value: <2e-16
autoplot(model6) + theme_bw()

# Check for collinearity
vif(model6)
##                      GVIF Df GVIF^(1/(2*Df))
## prop_type_simplified 1.52  4            1.05
## beds                 2.88  1            1.70
## bathrooms            1.54  1            1.24
## accommodates         2.83  1            1.68
## number_of_reviews    1.08  1            1.04
## review_scores_rating 1.10  1            1.05
## room_type            1.85  3            1.11
## host_is_superhost    1.15  1            1.07
## is_location_exact    1.00  1            1.00
# Summarise model estimates in dataframe
model6_estimates <- summary(model6)$coefficients[,1] 
model6_estimates <- exp(model6_estimates)
model6_estimates <- stack(model6_estimates)
model6_estimates <- model6_estimates %>% 
  select(ind, values)

model6_estimates

indvalues
(Intercept)2.09e+03
prop_type_simplifiedCondominium1.02    
prop_type_simplifiedHouse0.983   
prop_type_simplifiedLoft0.988   
prop_type_simplifiedOther1.09    
beds0.943   
bathrooms1.14    
accommodates1.11    
number_of_reviews1       
review_scores_rating1       
room_typeHotel room1       
room_typePrivate room0.489   
room_typeShared room0.296   
host_is_superhostTRUE1.09    
is_location_exactTRUE1.1     
We don’t find any high levels of collinearity in terms of VIF. We therefore don’t have to exclude any independent variables.

The effect of specifying the property’s exact location on prices is statistically significant: Ceteris paribus, staying at a place with exact location specified is associated with an increase in 4-nights rents of 9.6%%.

After controlling for other variables, is a listing’s exact location a significant predictor of price_4_nights?

For all cities, there are 3 variables that relate to neighbourhoods: neighbourhood, neighbourhood_cleansed, and neighbourhood_group_cleansed. There are typically more than 20 neighbourhoods in each city, and it wouldn’t make sense to include them all in the model. We would thus need to create a new categorical variabale neighbourhood_simplified and determine whether location is a predictor of price_4_nights

listings10 <- listings9 %>%
  
  # Create a new variables to regroup neighbourhoods
  mutate(neighbourhood_simplified = case_when(neighbourhood_cleansed %in% c("Cuauhtémoc",
                                                                       "Coyoacán", 
                                                                       "Miguel Hidalgo",
                                                                       "Benito Juárez", 
                                                                       "Tlalpan","
                                                                       Venustiano Carranza") ~ "Touristy", 
                                         neighbourhood_cleansed %in% c("Iztacalco",
                                                                       "Iztapalapa", 
                                                                       "Tláhuac",
                                                                       "Milpa Alta") ~ "Least Popular",
                                         TRUE ~ "Less Popular"))

As we could not break the boroughs down by geographical or administrative divisions, we decided to separate neighbourhoods by attractiveness to tourists: touristy areas, less popular areas, and unpopular areas (rough areas or those far from centre).

# Create regression
model7 <- lm(log(price_4_nights) ~ prop_type_simplified + beds + bathrooms + accommodates + number_of_reviews + review_scores_rating + room_type + host_is_superhost + is_location_exact + neighbourhood_simplified, 
             data = listings10)

summary(model7)
## 
## Call:
## lm(formula = log(price_4_nights) ~ prop_type_simplified + beds + 
##     bathrooms + accommodates + number_of_reviews + review_scores_rating + 
##     room_type + host_is_superhost + is_location_exact + neighbourhood_simplified, 
##     data = listings10)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.783 -0.337 -0.029  0.308  2.530 
## 
## Coefficients:
##                                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                           7.092041   0.076339   92.90  < 2e-16 ***
## prop_type_simplifiedCondominium       0.022441   0.018758    1.20     0.23    
## prop_type_simplifiedHouse             0.024890   0.016470    1.51     0.13    
## prop_type_simplifiedLoft             -0.014752   0.019847   -0.74     0.46    
## prop_type_simplifiedOther             0.115113   0.019357    5.95  2.8e-09 ***
## beds                                 -0.054244   0.004729  -11.47  < 2e-16 ***
## bathrooms                             0.119558   0.006542   18.28  < 2e-16 ***
## accommodates                          0.103445   0.003917   26.41  < 2e-16 ***
## number_of_reviews                    -0.000600   0.000102   -5.90  3.8e-09 ***
## review_scores_rating                  0.003249   0.000729    4.46  8.4e-06 ***
## room_typeHotel room                  -0.025764   0.047095   -0.55     0.58    
## room_typePrivate room                -0.710377   0.012627  -56.26  < 2e-16 ***
## room_typeShared room                 -1.197298   0.061926  -19.33  < 2e-16 ***
## host_is_superhostTRUE                 0.081382   0.010416    7.81  6.1e-15 ***
## is_location_exactTRUE                 0.075717   0.013999    5.41  6.5e-08 ***
## neighbourhood_simplifiedLess Popular  0.383522   0.032268   11.89  < 2e-16 ***
## neighbourhood_simplifiedTouristy      0.600633   0.029453   20.39  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.516 on 11276 degrees of freedom
##   (1621 observations deleted due to missingness)
## Multiple R-squared:  0.472,  Adjusted R-squared:  0.471 
## F-statistic:  630 on 16 and 11276 DF,  p-value: <2e-16
autoplot(model7) + theme_bw()

# Check for collinearity
vif(model7)
##                          GVIF Df GVIF^(1/(2*Df))
## prop_type_simplified     1.55  4            1.06
## beds                     2.89  1            1.70
## bathrooms                1.54  1            1.24
## accommodates             2.83  1            1.68
## number_of_reviews        1.09  1            1.04
## review_scores_rating     1.10  1            1.05
## room_type                1.85  3            1.11
## host_is_superhost        1.15  1            1.07
## is_location_exact        1.01  1            1.00
## neighbourhood_simplified 1.05  2            1.01
# Summarise model estimates in dataframe
model7_estimates <- summary(model7)$coefficients[,1] 
model7_estimates <- exp(model7_estimates)
model7_estimates <- stack(model7_estimates)
model7_estimates <- model7_estimates %>% 
  select(ind, values)

model7_estimates

indvalues
(Intercept)1.2e+03
prop_type_simplifiedCondominium1.02   
prop_type_simplifiedHouse1.03   
prop_type_simplifiedLoft0.985  
prop_type_simplifiedOther1.12   
beds0.947  
bathrooms1.13   
accommodates1.11   
number_of_reviews0.999  
review_scores_rating1      
room_typeHotel room0.975  
room_typePrivate room0.491  
room_typeShared room0.302  
host_is_superhostTRUE1.08   
is_location_exactTRUE1.08   
neighbourhood_simplifiedLess Popular1.47   
neighbourhood_simplifiedTouristy1.82   
We don’t find any high levels of collinearity in terms of VIF. We therefore don’t have to exclude any independent variables.

As expected, the type of neighbourhood is highly correlated with rents. Ceteris paribus, comparing to staying at Least Popular areas, staying at Touristy and Less Popular places is associated with an increase in 4-nights rents of 60.1% and 38.4% respectively.

What is the effect of cancellation_policy on price_4_nights, after we control for other variables?

model8 <- lm(log(price_4_nights) ~ prop_type_simplified + beds + bathrooms + accommodates + number_of_reviews + review_scores_rating + room_type + host_is_superhost + is_location_exact + neighbourhood_simplified + cancellation_policy, 
             data = listings10)

summary(model8)
## 
## Call:
## lm(formula = log(price_4_nights) ~ prop_type_simplified + beds + 
##     bathrooms + accommodates + number_of_reviews + review_scores_rating + 
##     room_type + host_is_superhost + is_location_exact + neighbourhood_simplified + 
##     cancellation_policy, data = listings10)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.735 -0.335 -0.026  0.306  2.519 
## 
## Coefficients:
##                                                 Estimate Std. Error t value
## (Intercept)                                     7.059924   0.076081   92.80
## prop_type_simplifiedCondominium                 0.027316   0.018673    1.46
## prop_type_simplifiedHouse                       0.027065   0.016390    1.65
## prop_type_simplifiedLoft                       -0.012802   0.019754   -0.65
## prop_type_simplifiedOther                       0.123647   0.019281    6.41
## beds                                           -0.052164   0.004710  -11.08
## bathrooms                                       0.117414   0.006513   18.03
## accommodates                                    0.099958   0.003912   25.55
## number_of_reviews                              -0.000688   0.000102   -6.76
## review_scores_rating                            0.003367   0.000726    4.64
## room_typeHotel room                            -0.027584   0.046994   -0.59
## room_typePrivate room                          -0.702306   0.012612  -55.69
## room_typeShared room                           -1.186791   0.061657  -19.25
## host_is_superhostTRUE                           0.073386   0.010419    7.04
## is_location_exactTRUE                           0.077462   0.013935    5.56
## neighbourhood_simplifiedLess Popular            0.378474   0.032116   11.78
## neighbourhood_simplifiedTouristy                0.590477   0.029340   20.13
## cancellation_policymoderate                     0.040541   0.011391    3.56
## cancellation_policystrict_14_with_grace_period  0.114395   0.012701    9.01
## cancellation_policysuper_strict_30              0.951945   0.148801    6.40
## cancellation_policysuper_strict_60              0.051269   0.515010    0.10
##                                                Pr(>|t|)    
## (Intercept)                                     < 2e-16 ***
## prop_type_simplifiedCondominium                 0.14354    
## prop_type_simplifiedHouse                       0.09870 .  
## prop_type_simplifiedLoft                        0.51695    
## prop_type_simplifiedOther                       1.5e-10 ***
## beds                                            < 2e-16 ***
## bathrooms                                       < 2e-16 ***
## accommodates                                    < 2e-16 ***
## number_of_reviews                               1.5e-11 ***
## review_scores_rating                            3.5e-06 ***
## room_typeHotel room                             0.55724    
## room_typePrivate room                           < 2e-16 ***
## room_typeShared room                            < 2e-16 ***
## host_is_superhostTRUE                           2.0e-12 ***
## is_location_exactTRUE                           2.8e-08 ***
## neighbourhood_simplifiedLess Popular            < 2e-16 ***
## neighbourhood_simplifiedTouristy                < 2e-16 ***
## cancellation_policymoderate                     0.00037 ***
## cancellation_policystrict_14_with_grace_period  < 2e-16 ***
## cancellation_policysuper_strict_30              1.6e-10 ***
## cancellation_policysuper_strict_60              0.92070    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.513 on 11272 degrees of freedom
##   (1621 observations deleted due to missingness)
## Multiple R-squared:  0.478,  Adjusted R-squared:  0.477 
## F-statistic:  515 on 20 and 11272 DF,  p-value: <2e-16
autoplot(model8) + theme_bw()

# Check for collinearity
vif(model8)
##                          GVIF Df GVIF^(1/(2*Df))
## prop_type_simplified     1.56  4            1.06
## beds                     2.89  1            1.70
## bathrooms                1.55  1            1.24
## accommodates             2.85  1            1.69
## number_of_reviews        1.10  1            1.05
## review_scores_rating     1.10  1            1.05
## room_type                1.88  3            1.11
## host_is_superhost        1.16  1            1.08
## is_location_exact        1.01  1            1.01
## neighbourhood_simplified 1.05  2            1.01
## cancellation_policy      1.09  4            1.01
# Summarise model estimates in dataframe
model8_estimates <- summary(model8)$coefficients[,1] 
model8_estimates <- exp(model8_estimates)
model8_estimates <- stack(model8_estimates)
model8_estimates <- model8_estimates %>% 
  select(ind, values)

model8_estimates

indvalues
(Intercept)1.16e+03
prop_type_simplifiedCondominium1.03    
prop_type_simplifiedHouse1.03    
prop_type_simplifiedLoft0.987   
prop_type_simplifiedOther1.13    
beds0.949   
bathrooms1.12    
accommodates1.11    
number_of_reviews0.999   
review_scores_rating1       
room_typeHotel room0.973   
room_typePrivate room0.495   
room_typeShared room0.305   
host_is_superhostTRUE1.08    
is_location_exactTRUE1.08    
neighbourhood_simplifiedLess Popular1.46    
neighbourhood_simplifiedTouristy1.8     
cancellation_policymoderate1.04    
cancellation_policystrict_14_with_grace_period1.12    
cancellation_policysuper_strict_302.59    
cancellation_policysuper_strict_601.05    
We don’t find any high levels of collinearity in terms of VIF. We therefore don’t have to exclude any independent variables.

The effects of cancellation policy on accommodation rents is statistically significant. Generally speaking, the stricter the cancellation policy, the higher the accommodation price. An exception is the cancellation policy “Super Strict 60 Days”, which is statistically indifferent from Flexible Cancellation policy. This may be because it is too extreme, and thus is only used by hosts of very small listings who need to maximise their earnings regardless of their customer service, or simply because there are not enough data points for this type of policy.

As for the other policies, moving from Flexible policy to Moderate, Strict (with grace period), and “Super Strict 30 Days” cancellation policy is expected to lead to a 4-nights price increase of 4.1%, 11.4%, and 95.1% respectively.

Summary Table

huxreg(
  
  # Choose models to include and set their names
  list("Property type and review stats" = model1,
            "Room type added" = model2,
            "Propery size added" = model4,
            "Superhost added" = model5,
            "Exact location availability added" = model6,
            "Neighborhood added" = model7,
            "Cancellation policy added" = model8),
  
  # Choose what statistics to display at the bottom
       statistics = c("#Observation" = "nobs",
                      "Adjusted R2" = "adj.r.squared",
                      "Residual SE" = "sigma"),
  
  # Bold coefficients that are statistically significant at 95% significance level
       bold_signif = 0.05) %>% 
  
  # Set table title
  set_caption("Comparison of models analysing 4-night Airbnb prices in Mexico City")
Comparison of models analysing 4-night Airbnb prices in Mexico City
Property type and review statsRoom type addedPropery size addedSuperhost addedExact location availability addedNeighborhood addedCancellation policy added
(Intercept)7.828 ***8.137 ***7.590 ***7.722 ***7.645 ***7.092 ***7.060 ***
(0.089)   (0.073)   (0.070)   (0.072)   (0.073)   (0.076)   (0.076)   
prop_type_simplifiedCondominium0.015    0.038    0.016    0.017    0.017    0.022    0.027    
(0.025)   (0.021)   (0.019)   (0.019)   (0.019)   (0.019)   (0.019)   
prop_type_simplifiedHouse-0.414 ***0.073 ***-0.016    -0.021    -0.017    0.025    0.027    
(0.020)   (0.018)   (0.017)   (0.017)   (0.017)   (0.016)   (0.016)   
prop_type_simplifiedLoft0.002    -0.143 ***-0.017    -0.011    -0.012    -0.015    -0.013    
(0.026)   (0.022)   (0.020)   (0.020)   (0.020)   (0.020)   (0.020)   
prop_type_simplifiedOther-0.110 ***0.090 ***0.085 ***0.088 ***0.089 ***0.115 ***0.124 ***
(0.024)   (0.021)   (0.020)   (0.020)   (0.020)   (0.019)   (0.019)   
number_of_reviews0.000 ***-0.000 ***-0.000 ** -0.000 ***-0.000 ***-0.001 ***-0.001 ***
(0.000)   (0.000)   (0.000)   (0.000)   (0.000)   (0.000)   (0.000)   
review_scores_rating0.005 ***0.004 ***0.005 ***0.003 ***0.003 ***0.003 ***0.003 ***
(0.001)   (0.001)   (0.001)   (0.001)   (0.001)   (0.001)   (0.001)   
room_typeHotel room        -0.093    0.001    0.007    0.005    -0.026    -0.028    
        (0.052)   (0.048)   (0.048)   (0.048)   (0.047)   (0.047)   
room_typePrivate room        -0.919 ***-0.725 ***-0.717 ***-0.716 ***-0.710 ***-0.702 ***
        (0.013)   (0.013)   (0.013)   (0.013)   (0.013)   (0.013)   
room_typeShared room        -1.215 ***-1.219 ***-1.212 ***-1.217 ***-1.197 ***-1.187 ***
        (0.067)   (0.064)   (0.064)   (0.063)   (0.062)   (0.062)   
beds                -0.060 ***-0.059 ***-0.059 ***-0.054 ***-0.052 ***
                (0.005)   (0.005)   (0.005)   (0.005)   (0.005)   
bathrooms                0.128 ***0.128 ***0.127 ***0.120 ***0.117 ***
                (0.007)   (0.007)   (0.007)   (0.007)   (0.007)   
accommodates                0.104 ***0.104 ***0.103 ***0.103 ***0.100 ***
                (0.004)   (0.004)   (0.004)   (0.004)   (0.004)   
host_is_superhostTRUE                        0.092 ***0.091 ***0.081 ***0.073 ***
                        (0.011)   (0.011)   (0.010)   (0.010)   
is_location_exactTRUE                                0.096 ***0.076 ***0.077 ***
                                (0.014)   (0.014)   (0.014)   
neighbourhood_simplifiedLess Popular                                        0.384 ***0.378 ***
                                        (0.032)   (0.032)   
neighbourhood_simplifiedTouristy                                        0.601 ***0.590 ***
                                        (0.029)   (0.029)   
cancellation_policymoderate                                                0.041 ***
                                                (0.011)   
cancellation_policystrict_14_with_grace_period                                                0.114 ***
                                                (0.013)   
cancellation_policysuper_strict_30                                                0.952 ***
                                                (0.149)   
cancellation_policysuper_strict_60                                                0.051    
                                                (0.515)   
#Observation11342        11342        11293        11293        11293        11293        11293        
Adjusted R20.042    0.358    0.439    0.443    0.445    0.471    0.477    
Residual SE0.694    0.568    0.531    0.530    0.528    0.516    0.513    
*** p < 0.001; ** p < 0.01; * p < 0.05.

Prediction Using Model

Finally, we will use the best model we came up with for prediction. Suppose you are planning to visit the city over reading week, and you want to stay in an Airbnb. Let’s assume you want to find Airbnb’s that are apartment with a private room, have at least 10 reviews, and an average rating of at least 90. Then, we will use my best model to predict the total cost to stay at this Airbnb for 4 nights for you.

Summary:

Apartment Private Room at least 10 reviews average rating at least 90

Added assumption: 1 bed, flexible cancellation policy, 1 bathroom, accommodates 2, less popular neighbourhood, not superhost, location is exact

Model to use: model8

Rationale: it has the highest R-squared, and is most extensive

imarginary_stay <- 
  
  # Create new data frame with imaginary scenario
  tibble(prop_type_simplified = "Apartment", 
                          beds = 1,
                          bathrooms = 1,
                          accommodates = 2,
                          number_of_reviews = 10,
                          review_scores_rating = 90,
                          room_type = "Private room",
                          host_is_superhost = FALSE,
                          is_location_exact = TRUE,
                          neighbourhood_simplified = "Less Popular",
                          cancellation_policy = "flexible")

model_prediction <- 
  
# Save prediction as data frame so it can be mutated later
  data.frame(
    
    # Make prediction using model8 and imaginary data set
    predict(model8, 
            newdata = imarginary_stay, 
            
            # Set CI of prediction
            interval = "prediction")) %>%
  
  # convert from log to USD values
  mutate(price = exp(fit),
         CI_lower = exp(lwr),
         CI_upper = exp(upr)) %>%
  
  # Remove unwanted columns
  select(-fit, -lwr, -upr)

#Show result
model_prediction

priceCI_lowerCI_upper
1.6e+035834.37e+03
Based on the model, the price of the 4-night accommodation is expected to be USD1595, with the 95% confidence interval being a wide range from USD583 to uSD4365.